Hello again.
Thanks for an update.
I do agree that there was a slight start time difference between the the first round of failed deployments (15:56:34) and a second round of successful deployments(16:17:49). Time difference is approx. 20 minutes, with failed one starting earlier. However, on such a huge span of time, this will be equal to only 20 extra retry later
requests, while the overall count of retry later
requests is already over 500.
I have investigated the exact number of the amount of retry later
requests:
- Failed deployment device requested
retry later
593 times - Successful deployment devices requested
retry later
:- 573 times for device 1
- 577 times for device 2
From this round of testing, it appears that the loop counter should not be involved. Otherwise, all 3 of them would have failed.
Now onto last night deployment test (3rd round). We have put 3 devices to a test:
- One of them was put into a scenario where
retry later
will NOT be requested - the update will be applied on boot up. - Two other devices were put into a scenario where
retry later
will be requested - the update will be applied at 2am local time.
And here are the findings:
- The device that had a deployment applied on boot up (no retry later mechanism involved) completed successfully.
- The two devices that were put into a scenario to update overnight failed.
For two failed devices, here are the statistics:
- 1st device requested retry later 564 times
- 2nd device requested retry later 563 times
I have separated logical points within the logs.
1st device log
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nOutside of update window.\n<redacted> App shutdown signal file NOT found.\nPostponing to retry update install process later.\n\n---------- end of script output","timestamp":"2024-02-15T01:59:51Z"}
{"level":"info","message":"statescript: ArtifactInstall_Enter_00_retry_later requested a retry","timestamp":"2024-02-15T01:59:51Z"}
{"level":"info","message":"Output (stdout) from command \"/var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\": <redacted>-<redacted>-app: signalled","timestamp":"2024-02-15T02:00:53Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nUpdate window reached.\n<redacted> App shutdown signal file NOT found.\nSending SIGUSR1 to <redacted> <redacted> app.\nSleeping 15 seconds before checking for <redacted> <redacted> App status again within the same script execution.\n<redacted> App shutdown signal file found.\nRemoving <redacted> App termianted and termination pending file.\nContinuing update process.\n\n---------- end of script output","timestamp":"2024-02-15T02:01:08Z"}
{"level":"info","message":"State transition: update-install [ArtifactInstall] -\u003e mender-update-control-refresh-maps [none]","timestamp":"2024-02-15T02:01:12Z"}
{"level":"info","message":"State transition: mender-update-control-refresh-maps [none] -\u003e mender-update-control [none]","timestamp":"2024-02-15T02:01:12Z"}
{"level":"info","message":"State transition: mender-update-control [none] -\u003e update-commit [ArtifactCommit_Enter]","timestamp":"2024-02-15T02:01:12Z"}
{"level":"error","message":"Could not write state data to persistent storage: State data stored and retrieved maximum number of times","timestamp":"2024-02-15T02:01:12Z"}
{"level":"error","message":"fatal error: State data stored and retrieved maximum number of times","timestamp":"2024-02-15T02:01:12Z"}
{"level":"error","message":"State transition loop detected in state update-commit: Forcefully aborting update. The system is likely to be in an inconsistent state after this.","timestamp":"2024-02-15T02:01:12Z"}
{"level":"info","message":"State transition: update-commit [ArtifactCommit_Enter] -\u003e update-status-report [none]","timestamp":"2024-02-15T02:01:12Z"}
{"level":"info","message":"Running Mender client version: 3.5.2","timestamp":"2024-02-15T02:01:12Z"}
{"level":"info","message":"State transition: update-fetch [Download_Enter] -\u003e update-store [Download_Enter]","timestamp":"2024-02-15T02:01:13Z"}
{"level":"info","message":"Installer: authenticated digital signature of artifact","timestamp":"2024-02-15T02:01:13Z"}
{"level":"info","message":"State transition: update-store [Download_Enter] -\u003e update-after-store [Download_Leave]","timestamp":"2024-02-15T02:01:19Z"}
{"level":"info","message":"State transition: update-after-store [Download_Leave] -\u003e mender-update-control-refresh-maps [none]","timestamp":"2024-02-15T02:01:19Z"}
{"level":"info","message":"State transition: mender-update-control-refresh-maps [none] -\u003e mender-update-control [none]","timestamp":"2024-02-15T02:01:19Z"}
{"level":"info","message":"State transition: mender-update-control [none] -\u003e update-install [ArtifactInstall]","timestamp":"2024-02-15T02:01:19Z"}
{"level":"info","message":"Executing script: ArtifactInstall_Enter_00_retry_later","timestamp":"2024-02-15T02:01:19Z"}
{"level":"info","message":"Output (stdout) from command \"/var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\": <redacted>-<redacted>-app: ERROR (not running)","timestamp":"2024-02-15T02:01:21Z"}
2nd device log
{"level":"info","message":"Output (stdout) from command \"/var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\": <redacted>-<redacted>-app: signalled","timestamp":"2024-02-15T02:00:12Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nUpdate window reached.\n<redacted> App shutdown signal file NOT found.\nSending SIGUSR1 to <redacted> <redacted> app.\nSleeping 15 seconds before checking for <redacted> <redacted> App status again within the same script execution.\n<redacted> App shutdown signal file found.\nRemoving <redacted> App termianted and termination pending file.\nContinuing update process.\n\n---------- end of script output","timestamp":"2024-02-15T02:00:28Z"}
{"level":"info","message":"State transition: update-install [ArtifactInstall] -\u003e mender-update-control-refresh-maps [none]","timestamp":"2024-02-15T02:00:31Z"}
{"level":"info","message":"State transition: mender-update-control-refresh-maps [none] -\u003e mender-update-control [none]","timestamp":"2024-02-15T02:00:31Z"}
{"level":"info","message":"State transition: mender-update-control [none] -\u003e update-commit [ArtifactCommit_Enter]","timestamp":"2024-02-15T02:00:31Z"}
{"level":"info","message":"Device unauthorized; attempting reauthorization","timestamp":"2024-02-15T02:00:31Z"}
{"level":"info","message":"Output (stderr) from command \"/usr/share/mender/identity/mender-device-identity\": using interface /sys/class/net/eth0","timestamp":"2024-02-15T02:00:31Z"}
{"level":"info","message":"successfully received new authorization data from server https://eu.hosted.mender.io","timestamp":"2024-02-15T02:00:31Z"}
{"level":"info","message":"shutting down websocket connections","timestamp":"2024-02-15T02:00:31Z"}
{"level":"info","message":"Local proxy stopped","timestamp":"2024-02-15T02:00:31Z"}
{"level":"warning","message":"error while sending close message: websocket: close sent","timestamp":"2024-02-15T02:00:31Z"}
{"level":"error","message":"Proxy Shutdown failed: context canceled\n","timestamp":"2024-02-15T02:00:31Z"}
{"level":"warning","message":"error while sending close message: websocket: close sent","timestamp":"2024-02-15T02:00:31Z"}
{"level":"info","message":"Local proxy started","timestamp":"2024-02-15T02:00:31Z"}
{"level":"info","message":"Reauthorization successful","timestamp":"2024-02-15T02:00:31Z"}
{"level":"info","message":"State transition: update-commit [ArtifactCommit_Enter] -\u003e update-after-first-commit [none]","timestamp":"2024-02-15T02:00:31Z"}
{"level":"error","message":"Could not write state data to persistent storage: State data stored and retrieved maximum number of times","timestamp":"2024-02-15T02:00:31Z"}
{"level":"error","message":"fatal error: State data stored and retrieved maximum number of times","timestamp":"2024-02-15T02:00:31Z"}
{"level":"error","message":"State transition loop detected in state update-after-first-commit: Forcefully aborting update. The system is likely to be in an inconsistent state after this.","timestamp":"2024-02-15T02:00:31Z"}
{"level":"info","message":"State transition: update-after-first-commit [none] -\u003e update-status-report [none]","timestamp":"2024-02-15T02:00:31Z"}
At the moment of writing this reply, here are the provides
of:
- 1st device:
artifact_name=<redacted>-<redacted>-app-4.1.0_INCONSISTENT
rootfs-image.checksum=15ab7a622a92d125d111b9772d95439a84f4d205a4995b6c0ef9d4c4e3979005
rootfs-image.<redacted>-<redacted>-app.version=4.0.0
rootfs-image.version=2024.01.1
- 2nd device:
artifact_name=<redacted>-<redacted>-app-4.1.0_INCONSISTENT
rootfs-image.checksum=15ab7a622a92d125d111b9772d95439a84f4d205a4995b6c0ef9d4c4e3979005
rootfs-image.<redacted>-<redacted>-app.version=4.1.0
rootfs-image.version=2024.01.1
According to the logs, 1st device failed a deployment, then deployment retry mechanism kicked in (part of Mender Professional), hence the deployment is now pending again as part of the retry.
On the other hand, 2nd device failed a deployment but deployment retry mechanism did not kick in to restart the deployment.
When we first implemented this, we were on a hosted trial plan. As part of a trial plan, we had configure
module included in the client (current rootfs
version). Now, we transitioned from trial plan to Professional without the configure
addon and we are seeing failures. Might that be related?
When we go onto a specific device page within the dashboard, we are constantly presented with a notification that configure
module is not present.
The request ID changes every time the pop up shows.
Please note, that all devices used in the tests started from the same rootfs
image.
Would it be helpful if I share the state scripts we attach o the artifacts and share the ideas of the deployment methodologies we use (boot time vs night time)?
Ps considering our plan, is there a better way to get support, or does forum remain the best approach?