[Self-hosted Mender 1.7] Deployment Stalls at 99%

We’re having problems completing deployments using our self-hosted Mender 1.7.0 server. The deployments stall at 99%. We’re reasonably confident our clients are running the 1.7.0 client. The clients are built with Yocto and run the Busybox OS. We’re using an external Minio service following the Amazon S3 integration as outlined in the docs.

The client is reporting a 500 error on the server, but we’re having difficulty finding the cause. Has anybody else also experienced this error? How can we enable debug-logging (and which containers) to troubleshoot this issue?

We’ve tried using “journalctl -xe -f -u mender-server.service” to view the service output during a stalled deployment. We did not see anything about a 500 error. We see log records like this before the client stalls:

Apr 11 12:35:52 [redacted-mender-server-hostname] run[13744]: mender-deployments_1    | time="2019-04-11T19:35:52Z" level=info msg="204 3620μs PUT /api/devices/v1/deployments/device/deployments/f5f44053-cb26-4f54-855e-915fc3b612c6/status HTTP/1.0 - Go-http-client/1.1" device_id=5caf939ff64d0000019816df file=middleware.go func="accesslog.(*AccessLogMiddleware).MiddlewareFunc.func1" line=60 request_id=6cb5ab0d-9181-4b14-aed1-febf3ced4b15
Apr 11 12:35:52 [redacted-mender-server-hostname] run[13744]: mender-api-gateway_1    | 172.24.10.2 - - [11/Apr/2019:19:35:52 +0000] "PUT /api/devices/v1/deployments/device/deployments/f5f44053-cb26-4f54-855e-915fc3b612c6/status HTTP/1.1" 204 0 "-" "Go-http-client/1.1" "[redacted-IP-addy]" request ID "6cb5ab0d-9181-4b14-aed1-febf3ced4b15" 0.006

Unfortunately I don’t have a copy of the deployment log from the mender client. This is the output from systemctl status mender on our client when it’s stalled at 99% (on the server):

[[0;1;32m●[[0m mender.service - Mender OTA update service
       Loaded: loaded (/lib/systemd/system/mender.service; enabled; vendor preset: enabled)
       Active: [[0;1;32mactive (running)[[0m since Tue 2018-09-04 10:27:07 PDT; 7 months 5 days ago
      Process: 240 ExecStartPre=/bin/ln -sf /etc/mender/tenant.conf /var/lib/mender/authtentoken (code=exited, status=0/SUCCESS)
      Process: 221 ExecStartPre=/bin/mkdir -p -m 0700 /data/mender (code=exited, status=0/SUCCESS)
     Main PID: 245 (mender)
       CGroup: /system.slice/mender.service
               └─245 /usr/bin/mender -daemon

Apr 11 09:51:32 agent mender[245]: level=info msg="State transition: update-retry-report [ArtifactCommit] -> update-status-report [ArtifactCommit]" module=men
der
Apr 11 09:51:32 agent mender[245]: time="2019-04-11T09:51:32-07:00" level=info msg="State transition: update-retry-report [ArtifactCommit] -> update-status-re
port [ArtifactCommit]" module=mender
Apr 11 09:51:33 agent mender[245]: [[0;1;31mlevel=error msg="got unexpected HTTP status when reporting status: 500" module="client_status"[[0m
Apr 11 09:51:33 agent mender[245]: time="2019-04-11T09:51:33-07:00" level=error msg="got unexpected HTTP status when reporting status: 500" module="client_sta
tus"
Apr 11 09:51:33 agent mender[245]: [[0;1;31mlevel=error msg="error reporting update status: (request_id: ): reporting status failed, bad status 500 server err
or message: internal error" module=mender[[0m
Apr 11 09:51:33 agent mender[245]: time="2019-04-11T09:51:33-07:00" level=error msg="error reporting update status: (request_id: ): reporting status failed, b
ad status 500 server error message: internal error" module=mender
Apr 11 09:51:33 agent mender[245]: [[0;1;31mlevel=error msg="failed to send status to server: transient error: (request_id: ): reporting status failed, bad st
atus 500 server error message: internal error" module=state[[0m
Apr 11 09:51:33 agent mender[245]: time="2019-04-11T09:51:33-07:00" level=error msg="failed to send status to server: transient error: (request_id: ): reporti
ng status failed, bad status 500 server error message: internal error" module=state
Apr 11 09:51:33 agent mender[245]: level=info msg="State transition: update-status-report [ArtifactCommit] -> update-retry-report [none]" module=mender
Apr 11 09:51:33 agent mender[245]: time="2019-04-11T09:51:33-07:00" level=info msg="State transition: update-status-report [ArtifactCommit] -> update-retry-re
port [none]" module=mender

AFAIK, Server will wait for client to download the artifact and reboot to check whether it is proper. Once, the client can reboot to new partition and can establish the connection again with the server, then server considered it as a successful update. Otherwise, it will simply wait for getting the connection at server side (while showing 99%).
Are you sure the network got established at the client side, after the reboot?

I’m not sure if the reboot has happened on the device at 99%. The progress indicator says rebooting from ~92%-99% and we see PUTs w/2xx responses in the (systemctl and docker) logs during this interval, but I’m not sure where in that process the device does the reboot. Does it reboot at 92% or 99? Irregardless, the deployments get stuck. We have one such deployment stuck for 5.5 hrs. In lieu of a fix, what we want to know is how to troubleshoot the 500 error. How do we enable debug logging on the docker containers?

To get container logs,

https://docs.mender.io/1.7/troubleshooting/mender-server#container-logs

Hi @ajithpv,

I work with @fbeachler and am involved with this troubleshooting effort. I can confirm that my test device did completely reboot on the new image. It then went into the mode of trying to POST and update status report back to the server to show the deployment as complete. Here is a log from the device when it is in this state:

Apr 11 15:05:57 boomtown-agent mender[230]: [[0;1;31mlevel=error msg="failed to report data success: reporting status failed, bad status 404" module=state[[0m
Apr 11 15:10:57 boomtown-agent mender[230]: level=info msg="attempting to report data of deployment [4340f1ee-d1ea-4dfe-8e08-8cc510d55e0f] to the backend; deployment status [success], try 2" module=state
Apr 11 15:10:57 boomtown-agent mender[230]: time="2019-04-11T15:10:57-07:00" level=info msg="attempting to report data of deployment [4340f1ee-d1ea-4dfe-8e08-8cc510d55e0f] to the backend; deployment status [success], try 2" module=state
Apr 11 15:10:57 boomtown-agent mender[230]: [[0;1;31mlevel=error msg="got unexpected HTTP status when reporting status: 404" module="client_status"[[0m
Apr 11 15:10:57 boomtown-agent mender[230]: [[0;1;31mlevel=error msg="error reporting update status: reporting status failed, bad status 404" module=mender[[0m
Apr 11 15:10:57 boomtown-agent mender[230]: time="2019-04-11T15:10:57-07:00" level=error msg="got unexpected HTTP status when reporting status: 404" module="client_status"
Apr 11 15:10:57 boomtown-agent mender[230]: time="2019-04-11T15:10:57-07:00" level=error msg="error reporting update status: reporting status failed, bad status 404" module=mender
Apr 11 15:10:57 boomtown-agent mender[230]: time="2019-04-11T15:10:57-07:00" level=error msg="failed to report data success: reporting status failed, bad status 404" module=state
Apr 11 15:10:57 boomtown-agent mender[230]: [[0;1;31mlevel=error msg="failed to report data success: reporting status failed, bad status 404" module=state[[0m

And here is what the corresponding request looked like in the mender-api-gateway log:

Apr 11 15:05:57 oa.uat.sjc.boomtown.cloud run[5737]: mender-api-gateway_1 | 172.24.10.2 - - [11/Apr/2019:22:05:57 +0000] "PUT //api/devices/v1/deployments/device/deployments/4340f1ee-d1ea-4dfe-8e08-8cc510d55e0f/status HTTP/1.1" 404 59 "-" "Go-http-client/1.1" "74.123.252.3" request ID "533d8e73-4c62-4081-8c3c-bf05582c1d5e" 0.003

Notice the double // in the URL seems to cause it to get a 404.
Any clues?

Thanks,
Daniel

What is the content of /etc/mender/mender.conf when it fails to report status?

Same as it was before:

{
  "ClientProtocol": "http",
  "HttpsClient": {
    "Certificate": "",
    "Key": ""
  },
  "RootfsPartA": "/dev/mmcblk0p2",
  "RootfsPartB": "/dev/mmcblk0p3",
  "UpdatePollIntervalSeconds": 1800,
  "InventoryPollIntervalSeconds": 1800,
  "RetryPollIntervalSeconds": 300,
  "ServerURL": "https://mender.uat.thinkrelay.com/",
  "ServerCertificate": "/etc/mender/server.crt"
}

Our test case involves 2 builds where only the artifact name is different. We’re trying to eliminate variables by focusing on the server change only (e.g. using external Minio vs. embedded one).

I do notice the trailing /, since you pointed out earlier that // would be causing the 404 error, this is where it is coming from.

Thanks @mirzak. Here’s a recap of our steps:

  • Updated mender-server to 1.7.0 from 1.5.1.
  • Reconfigured prod.yml to use an external S3 server (separate Minio instance).
  • Built a new artifact on our poky build environment to increment the artifact number – no other changes were made in this build from the previous running image.
  • Uploaded the artifact to our mender-server to test the external S3 server.
  • Re-pointed my device (the one I posted logs for above) to the mender-server and authorized it.
  • Created a deployment of the new artifact to my device to test the external S3 server in a deployment scenario.
  • Verified my device rebooted itself onto the updated build/partition.
  • Ran sudo sysctl status mender on the device and verified it was trying to post its update status back to the mender-server, with the PUT // request occurring at a couple points and getting back the 404.

My gut feeling is that something with the S3 change has caused this behavior to occur. The reason I believe this is that before we rolled out the S3 change, we first tested a deployment to our client on the stock 1.7.0 mender-server with the embedded Minio+storage proxy. It did not exhibit this problem. Only after changing to the external S3 did we see this issue.

Right now I am in the process of reverting mender-server to the embedded Minio+storage proxy to verify that it works. I will posted back based on that to confirm or deny my theory. In the mean time, if you have any further suggestions please help. We’re growing weary after losing an entire 2 man dev week to this situation. :frowning:

Thanks,
Daniel

Can you share more of the server side logs?

Edit: not just errors, before and after as well.

The server-side configuration changes you made to switch to your external Minio instance would be helpful as well.

@mirzak thanks to you and everyone else for trying to help us here. I’m happy to report the issue has been resolved and everything is running great on mender-server 1.7.0 with external S3 artifacts (e.g. standalone Minio instance). As you suggested, the issue was the trailing / on the URL. Interestingly, only this one PUT request was affected by it. All other requests only showed one / in the URLs.

FWIW your documentation has conflicting information in it (see https://docs.mender.io/1.7/client-configuration/configuration-file/configuration-options#servers). The ServerURL parameter is documented as to include a trailing slash. Yet the Servers parameter (array form) does not show/include one.

Thanks again to all for helping us. Problem resolved.

Best,
Daniel

Thanks @boomtown-dd for the feedback.

I have created a bug for this for further investigation,

https://tracker.mender.io/browse/MEN-2464