Summary
On mender 5.0.3 (C++ client, `mender-updated`), a network outage **during the artifact download phase** leads to a permanent wedge. After the HTTP download resumer exhausts its retries and raises `DownloadResumerError`, the client **hangs**: it does not report the deployment as `Failure`, does not return to the poll loop, and stops emitting any log output. The device looks “stuck updating” indefinitely and does **not** self-recover even after the network returns. Only `systemctl restart mender-updated` (or a reboot) clears it.
This is related to but distinct from Client does not resume download and hangs indefinitely when server connection is severed . In that report the TCP socket stays “established” so the read never fails and resume never triggers. In our case the read *does* fail, resume *does* trigger and retries normally, **then gives up**, and the hang happens *after* the give-up.
Environment
- Client: mender 5.0.3 (`Running Mender client 5.0.3`), C++ client (`mender-update` / `mender-updated`)
- Platform: embedded Linux (Yocto-based), eth0
- Config: `UpdatePollIntervalSeconds=30`, `RetryPollIntervalSeconds=300`, `RetryDownloadCount` left at default (10)
Steps to reproduce
- Start a deployment and let it enter the download phase (`Download_Enter`).
- ~10 s to 30 s after download starts, cut the network so DNS/connection fails (physically unplug eth0). A real link-down makes DNS fast-fail every 60 s;
- Leave the network down long enough to exhaust the resume retries
- Restore the network.
Observed behaviour
- Resume retries log normally (`name=“http_resumer:client”`), then: `Giving up on resuming the download: Tried maximum number of times: Exponential backoff`
- **After that line, `mender-updated` logs nothing** in our test, ~66 min of total silence, including ~3 min after the link was restored.
- The service stays `active (running)` but is wedged: no Failure report to the server, no further polling, no `Sync_Enter`. No self-recovery when the network returns.
- `systemctl restart mender-updated` recovers it **instantly**: it immediately sends a status update, logs `Deployment … finished with status: Failure`, and resumes `No update available` polling.
The fact that a plain restart fixes it (with no on-disk state change) indicates only the **in-memory deployment state machine** is stuck, not any persistent state.
Expected behaviour
When the download resumer gives up (`DownloadResumerError`), the client should follow the same path as any other download failure, transition to ArtifactFailure / report deployment `status: Failure` and return to the Idle/Sync poll loop rather than hanging.
Likely location
The resumer itself behaves correctly: `src/common/http/http_resumer.cpp` (`DownloadResumerClient::ScheduleNextResumeRequest()`) raises the `http::DownloadResumerError` once `ExponentialBackoff(chrono::minutes(1), config.retry_download_count)` is exhausted. The defect appears to be in the **caller** (the deployment state machine’s `Download` / `ArtifactDownload` handler) which doesn’t consume that error to drive a state transition, so the deployment dead-ends instead of failing cleanly.