[BUG] mender-update hangs after download-resume gives up, never reports Failure, never returns to poll loop (5.0.3)

tck · June 26, 2026, 2:15pm

Summary

On mender 5.0.3 (C++ client, `mender-updated`), a network outage **during the artifact download phase** leads to a permanent wedge. After the HTTP download resumer exhausts its retries and raises `DownloadResumerError`, the client **hangs**: it does not report the deployment as `Failure`, does not return to the poll loop, and stops emitting any log output. The device looks “stuck updating” indefinitely and does **not** self-recover even after the network returns. Only `systemctl restart mender-updated` (or a reboot) clears it.

This is related to but distinct from Client does not resume download and hangs indefinitely when server connection is severed . In that report the TCP socket stays “established” so the read never fails and resume never triggers. In our case the read *does* fail, resume *does* trigger and retries normally, **then gives up**, and the hang happens *after* the give-up.

Environment

Client: mender 5.0.3 (`Running Mender client 5.0.3`), C++ client (`mender-update` / `mender-updated`)
Platform: embedded Linux (Yocto-based), eth0
Config: `UpdatePollIntervalSeconds=30`, `RetryPollIntervalSeconds=300`, `RetryDownloadCount` left at default (10)

Steps to reproduce

Start a deployment and let it enter the download phase (`Download_Enter`).
~10 s to 30 s after download starts, cut the network so DNS/connection fails (physically unplug eth0). A real link-down makes DNS fast-fail every 60 s;
Leave the network down long enough to exhaust the resume retries
Restore the network.

Observed behaviour

Resume retries log normally (`name=“http_resumer:client”`), then: `Giving up on resuming the download: Tried maximum number of times: Exponential backoff`
**After that line, `mender-updated` logs nothing** in our test, ~66 min of total silence, including ~3 min after the link was restored.
The service stays `active (running)` but is wedged: no Failure report to the server, no further polling, no `Sync_Enter`. No self-recovery when the network returns.
`systemctl restart mender-updated` recovers it **instantly**: it immediately sends a status update, logs `Deployment … finished with status: Failure`, and resumes `No update available` polling.

The fact that a plain restart fixes it (with no on-disk state change) indicates only the **in-memory deployment state machine** is stuck, not any persistent state.

Expected behaviour

When the download resumer gives up (`DownloadResumerError`), the client should follow the same path as any other download failure, transition to ArtifactFailure / report deployment `status: Failure` and return to the Idle/Sync poll loop rather than hanging.

Likely location

The resumer itself behaves correctly: `src/common/http/http_resumer.cpp` (`DownloadResumerClient::ScheduleNextResumeRequest()`) raises the `http::DownloadResumerError` once `ExponentialBackoff(chrono::minutes(1), config.retry_download_count)` is exhausted. The defect appears to be in the **caller** (the deployment state machine’s `Download` / `ArtifactDownload` handler) which doesn’t consume that error to drive a state transition, so the deployment dead-ends instead of failing cleanly.

TheYoctoJester · July 1, 2026, 11:07am

Hi @tck,

I’ve raised this to the devs, and it doesn’t sound like something we already know about. But we’d definitely like to understand and fix the problem, so the obvious first question is, do you know if the problem was introduced recently? Does it already exist in 5.0.2, respectively still in 5.0.4 or 5.0.5?

Greetz,
Josef

tck · July 2, 2026, 9:51am

Hi @TheYoctoJester,

I reproduced the exact same hang on all three 5.0.2, 5.0.4 and 5.0.5 (Raspberry Pi CM4, Yocto scarthgap, C++ mender-update daemon). Procedure each time: trigger an application-artifact deployment, physically disconnect the network during the download, wait ~15 min, reconnect.

Here is a redacted, warning only mender-updated journal from the 5.0.5 run as a representative example (5.0.2 and 5.0.4 are identical). Line references:

lines 7 to 12 : initial device-acceptance / auth-bootstrap phase right after a fresh flash; unrelated to the bug (see the note on L6).
line 13 (08:07:46) : the in-flight download read fails after the cut: http_resumer:reader … The operation timed out: Could not read body.
lines 14 to 23 (08:08:46 to 08:17:46) : exactly 10 resume attempts, one per minute (http_resumer:client … Host not found), i.e. ExponentialBackoff(1min, 10).
lines 24 to 25 (08:17:46) : Resume download error: Giving up on resuming the download: Tried maximum number of times: Exponential backoff.

After the give-up, the mender-update daemon stays alive but wedged: State: S, wchan=do_epoll_wait, 2 threads, no open sockets. It never reports the deployment as failed and never returns to the poll loop — even after the network is restored (verified from 15 min up to ~15 h later). Only a service restart / reboot recovers it.

So the behaviour is not specific to 5.0.3 : 5.0.2, 5.0.4 and 5.0.5 all show it.

A redacted info log is also available, showing the lack of any record_id once the process gives up during the outage, even though network connectivity was restored.

TheYoctoJester · July 2, 2026, 10:26am

Thanks for the detailed bug report and analysis! I’ve forwarded everything to the dev team, will get back to you as soon as I have any news!

Greetz,
Josef

TheYoctoJester · July 2, 2026, 4:24pm

Hi @tck,

It’s really only a first stab, but if you have the setup to test it, can you give the state of this PR, respectively the resulting patch a spin? fix: report download failure when resume give-up occurs mid-stream by TheYoctoJester · Pull Request #1988 · mendersoftware/mender · GitHub

Greetz,
Josef

tck · July 3, 2026, 9:25am

I tested this PR using the exact protocol from my original report: the fix works.

Setup: Raspberry Pi CM4 (Yocto scarthgap image), mender-client 5.0.3 with this PR applied as a patch, self-hosted server, artifact on S3. UpdatePollIntervalSeconds=30, RetryPollIntervalSeconds=300, log level info.

Protocol: trigger a deployment, pull the network cable seconds after the download starts, wait ~20 min, replug.

Summarized log: (full anonymized log is also available here)
08:21:00 info Deployment with ID 1eee1487-... started.
08:22:01 [network cable pulled]
08:26:09 info Will try to resume after error ... Read timed out
... (10 resume attempts, 60 s apart)
08:36:09 error Resume download error: Giving up on resuming the download: Tried maximum number of times: Exponential backoff
08:36:09 error ... Truncated tar archive detected while reading data
08:36:09 info Sending status update to server
... (status push retried with backoff while network still down)
08:42:19 [cable replugged]
08:43:09 info Deployment with ID 1eee1487-... finished with status: Failure
08:43:10 info No update available ← poll loop resumed

The status (fail) is then visible in the mender web interface.

Note: If the network hasn’t recovered during the status push backoff period (45+ minutes ?), we would probably see the target give up without notifying the server, and the update might be restarted whenever network connectivity is restored (this sounds like a reasonable and acceptable behaviour). I will try this at a later date.

tck · July 3, 2026, 12:12pm

I’ve had time to test the wait past the status push backoff period case, and I can confirm that it works.
If the network access is restored after ~50 minutes:

The server isn’t notified of the failure (is still show downloading)
The update eventually restarts (triggered by the server)
The update does get installed and is properly reported as successful on the server (with a start time of the first attempt before the network disconnection and an end time of the second attempt completetion, about an hour later)

This completely validates the PR.
Thanks for your availability!

TheYoctoJester · July 8, 2026, 11:46am

Hi @tck,

To follow up, we are tracking the issue at Jira and will merge a fix soon.

Greetz
Josef

Topic		Replies	Views
Stop mender update after header mismatch General Discussions update-modules , warrior , hosted-mender	6	688	December 2, 2020
Update-Loop - overwrite partition b, then a, again General Discussions	3	524	December 22, 2021
ERRO[0200] Download connection broken: unexpected EOF module=update_resumer General Discussions yocto , sumo , nxp	5	1311	July 11, 2019
Using mender with unreliable internet connection General Discussions	1	386	October 3, 2022
Client does not resume download and hangs indefinitely when server connection is severed General Discussions mender-client	2	206	April 17, 2025

[BUG] mender-update hangs after download-resume gives up, never reports Failure, never returns to poll loop (5.0.3)

Summary

Environment

Steps to reproduce

Observed behaviour

Expected behaviour

Likely location

Related topics