System Software Version after using mender-convert

Hey guys,

First just wanted to thank everyone that contributed to this product!

I was wondering if anyone could help with a question regarding the System Software Version (as visible in the GUI of the server) which is under the rootfs-partition update module (namespace?).

When creating a new golden device dump using mender-convert, how does one include the System Software Version, as well as an Application Software Version, so that it is available in the .img file (to be used to commission new client devices).

So far I know how to pass this information along from mender-convert to mender-artifact using the available hook, but these versions are then only available for the artifact file (file which will then be deployed to existing clients).

So existing clients, no problem there, however newly comissioned devices, there is nothing there to say what is the “starting” point. I am asking this as my assumption would be that following directory/single-file deployments would make sense to have a “depend”, but i fails to see how i can make this new deployment depend on a specific version, if the client does not really “have” one

Thanks in advance

Hi @robertalexa,

Thanks for getting in touch! Concerning software depedencies, there is a quite sophisticated and powerful mechanism available. It is described in the documentation here.

In order to leverage it, you can pass the relevant flags to mender-artifact through mender-convert, and to a substantial extent also modify the Artifacts directly.

Greetz,
Josef

Hi @TheYoctoJester . I appreciate your reply.

Just wanted to mention that we are already doing that and you might have misread my message?

As per above, the correct software version are passed and presented as part of the artifact generated as a side effect of running mender-convert, and that is great for existing devices in the field.

However if we are to commission a new device from the .img generated via the same process, the software version is not preserved (as far as we can tell) and not presented in the GUI. As a result, is it not clear what a newly commissioned device is running, until it gets its next update, from a future artifact.

Hope that makes sense?

Hey @TheYoctoJester

Don’t mean to be pestering you directly so accept my apologies if i’ve bothered you!

Let me rephrase my “challenge” and maybe that helps you help me with out implementation. It may just be a case of my misunderstanding the operational flow (and unfortunately i have found some of the information on the docs to be contradictory)

  1. Is mender-convert meant to be used once - the very first time we create a mender ready image - or every time we require to do apt packages or full on system updates? The information on Convert a Mender Debian image | Mender documentation seems to be slightly in discrepancy from Create an Artifact with system snapshot | Mender documentation

  2. If mender-convert is meant to be used every time a full on image needs to be generated, this is where my initial question comes from. How can you pass along the relevant version? The mender-convert script command does not have any passable arguments (this may be a PR suggestion so people don’t overwrite the original file from your repo and instead just pass things along natively). In our case, we have added 2 extra ENV variables, which we then pass along into mender-artifact. (ROOTFS_SOFTWARE_VERSION, APPLICATION_SOFTWARE_VERSION)

docker run \
  --rm \
  -v "$INPUT_DIRECTORY":/mender-convert/input \
  -v "$LOGS_DIRECTORY":/mender-convert/logs \
  -v "$DEPLOY_DIRECTORY":/mender-convert/deploy \
  --privileged=true \
  --cap-add=SYS_MODULE \
  -v /dev:/dev \
  -v /lib/modules:/lib/modules:ro \
  --env MENDER_ARTIFACT_NAME="${MENDER_ARTIFACT_NAME}" \
  --env MENDER_CONVERT_LOG_FILE=logs/"${LOG_FILE}" \
  --env MENDER_CONVERT_VERSION=${GIT_PROVIDED_TAG_NAME} \
  --env ROOTFS_SOFTWARE_VERSION="${ROOTFS_SOFTWARE_VERSION}" \
  --env APPLICATION_SOFTWARE_VERSION="${APPLICATION_SOFTWARE_VERSION}" \
  "$IMAGE_NAME" "$@"

When doing so, we have also hooked into mender_create_artifact Customization | Mender documentation

Simplified version:

mender_create_artifact() {
  local -r device_type="${1}"
  local -r artifact_name="${2}"

  mender_artifact_name=${device_type}-${artifact_name}.mender
  mender_artifact_path=deploy/${mender_artifact_name}
  log_info "Running custom implementation of the 'mender_create_artifact' hook"

  log_info "Writing Mender artifact to: ${mender_artifact_path}"
  log_info "This can take up to 20 minutes depending on which compression method is used"
  run_and_log_cmd "mender-artifact --compression ${MENDER_ARTIFACT_COMPRESSION} \
      write rootfs-image \
      --file work/rootfs.img \
      --output-path ${mender_artifact_path} \
      --artifact-name ${artifact_name} \
      --device-type ${device_type} \
      --software-version ${ROOTFS_SOFTWARE_VERSION} \
      --provides rootfs-image.application.version:${APPLICATION_SOFTWARE_VERSION} \
      --script input/fd/state_scripts/ArtifactInstall_Enter_00_retain_user_data"
}

As mentioned, the above retains information above version on the newly created artifact, but not onto the image that is meant to be used to commission new devices.

  1. If mender-convert is NOT meant to be used every time, and instead mender-artifact should be used (as presented in the above mentioned article about full system snapshot), then I am even more confused. The article refers to a golden device, but since the golden device does not contain the mender client, how would that actually work? Also, what about any custom overlays, certain files in certain places, scripts

  2. The docs say that sudo apt update/upgrade should not be done on a device containing the mender-client, which re-enforces my thinking at point 3 that the “golden device” is actually “barebone”

  3. (This is a LE) - looking at release notes and code of mender-convert 4.0.0 and mender-client 3.5.0, it seems that a bootstrap-artifact has been implemented, in what i assume to be a “placeholder” artifact that would inform the server of certain pieces of info? I suspect that this could be used to pass along the system software version as well as what application version it provides? However this is not documented at all in the current version of the docs. If this is the real solution, then there is no need for passable arguments, in that case, the bootstrap artifact can contain that information and be updated manually for every system deployment and version controller appropriately?

  4. (This is a LE) - are we even meant to use these recent versions since Mender Server 3.5 was withdrawn? Is the dependency actually from the server downwards? e.g. looking at the JSON https://docs.mender.io/releases/versions.json and matching the version of the server with the corresponding sub-systems?

In simplier form, I guess my confusion just stems from potentially contradictory information (or at least what i interpret as contradictory). As a side note, what definitely lacks from the docs is a “real-world use case” type of example/definition.

Things like:

  • start from scratch
  • put your application
  • add your overlays
  • when to convert
  • what to convert
  • what about when times come for an application change
  • what about when times come for an os upgrade

Any direction would be greatly appreciated here!

Hi @robertalexa

No problem, I don’t feel pestered. Just traveling and attending an event this week and therefore not having the capacity to carefully work through the post. So it is perfectly possible that I did not read good enough in the first place. Apologies!

Maybe @kacf or @lluiscampos can chime in here?

Greets
Josef

1 Like

I was just about to politely ask Kristian and Louis for some help as an offtopic question attached to one of my open PRs, but I appreciate you pinging them.

Offtopic: good luck with your event!

1 Like

Hi @robertalexa ,

Let me clarify one implementation detail that will throw some light into some of the questions you are having.

The System Software Version and the Application Software Version is an information available in the local database of each device. The database is located in the data partition. This database is initialized at first boot with an special “bootstrap Artifact”, when such Artifact is found. Otherwise is initialized with some placeholder values.

This is the reason why these Versions information follow on an Artifact metadata but not on an mender-convert generated img: the database is individual per device and it is not “recreated” or prepared in any form when mender-convert creates a new img.

However, you can utilize the mentioned bootstrap Artifact mechanism to inject such version information (or any other user-defined keys, really) in your newly flashed devices. This is the part of the mender-artifact-package that generate the bootstrap Artifact:

You can write a platform hook that generates your custom bootstrap Artifact and install it in the same location. That should result in what you are asking for: the devices will have the desired versions on first boot.

Otherwise if you have a suggestion on how could we better support this use case out of the box I’d be happy to review more code from you :slight_smile:

LluĂ­s

2 Likes

Hi @lluiscampos

Thank you kindly for the reply. That basically touches on my original point 5.

This doesn’t seem to be documented anywhere as of now? (correct me if i am wrong) And as such it raises a few more questions in my mind before I can 100% say that i understand the issue at hand.

  1. If I create a PLATFORM_MODIFY_HOOKS that creates a bootstrap artifact I have the following questions:
  1. If i create a PLATFORM_PACKAGE_HOOKS with the idea to overwrite the boostrap artifact that was natively created:
  1. If i create my own custom file for mender-convert-package i can control this however i want, but then i will fall out of sync with your codebase

Thanks

1 Like

Not documented. We considered it an implementation detail and not something that a user might tinkle with. I was giving it to you because, after all, it is open source and you can exploit it as you wish :slight_smile:

As a side note: the main motivation of this bootstrap Artifact was to support delta Artifacts (Enterprise feature) from first boot, nothing else…

I was thinking about a platform package hook overriding or possibly modifying (*) the bootstrap Artifact. It is packaged into the final image here.

(*) I am not sure we can use mender-artifact modify on an bootstrap Artifact, that I have not done before.

Yes, the hooks are meant to be the entry point for user modifications.

LluĂ­s

PS: this thread might go silent for a while because starting tomorrow I am going on vacation for few weeks :grin:

1 Like

@robertalexa I found that there is some documentation of the feature here.

1 Like

Hi @lluiscampos hope you had a good time off!

Been struggling to find time (commercially) to continue with our integration hence the radio silence.

I think the above information, combined with our playground has put us roughly on track.

I do have a question if you would be kind to help:

We hardware does not have any user inputs (screens buttons etc), but are engaged in processes that must be gracefully shut down to avoid data loss. So our approach would be to handle updates, if existing, on a nightly basis at 2AM

Config:
UpdateControlMapExpirationTimeSeconds - 48 hours (long time to avoid expiration)
UpdateControlMapBootExpirationTimeSeconds - 600 (default)
UpdateControlMapPollIntervalSeconds - 60 seconds

State script
ArtifactInstall_Leave
- dbus update control map - uuid FIXED_VALUE_UUID, priority 0, ArtifactCommit_Enter - pause
- SIGNAL to application that an update is due
ArtifactCommit_Leave (on just directory updates)
- supervisor restart main_app.py

Application main thread
When SIGNAL is received, start graceful app halting.
When ready, send dbus update control map uuid FIXED_VALUE_UUID, priority 0, ArtifactCommit_Enter - continue

My hope is that by using the same UUID and the same priority, the update control map will be updated in place. I am loosely basing my assumption on the code here https://github.com/mendersoftware/mender/blob/e7ac5bc79c078c6af69ac36955677f886a734c5d/app/updatemanager.go#L141

When another deployment happens, the cycle repeats.

Would you be kind to sanity check this approach? Would really appreciate some clarity around this.

PS: For the sake of clarity, our application is a python app kept alive using supervisor. The choice of letting ArtifactInstall happen naturally was on purpose, as this will make the “deployment” time faster (from starting graceful stop to resuming) for both A/B and Directory updates. Given that the python app is ran via supervisor, it “old” version will continue to run in memory. The supervisor will restart the task on ArtifactCommit_Leave

Regards
Rob

Hello @robertalexa. I had a lovely vacation, thanks!

Based on your description, it seems like the use of update control maps is not because a (human) user needs to interact but basically that you need some good time for a graceful (non-human) shutdown of a given application.

I suggest then dropping the use of update control maps altogether and just rely on the “retry later” mechanism of state scripts. See here which is designed exactly for this use case.

Then the high level idea is to have one single state script (you decide in which state of the state machine) that. The script checks if the application is running - if so signals the application for shutdown and returns “retry later” to Mender. On successive calls the application will be done at some point, then the script can just return “ok” to Mender.

Does this make sense? Why do you think that update control maps is a better fit?

LluĂ­s

Hey guys,

Thanks for sharing your knowledge and providing hints on the best direction to go in terms of Mender integration. @robertalexa is busy with another project at the moment so I would like to intervene and resolve some design concerns encountered in a meanwhile.

@lluiscampos thanks for a proposed approach. To reiterate - a state script will be returning 21 (“retry later”) until all the requirements are satisfied. Then, when all the requirements are satisfied - state script returns 0 - this allows Mender update to continue.

The approach of “retry later” seems much more straight forward then update control maps manipulation. Seems like we overlooked such a neat approach to an issue :sweat_smile:.

However, there is a specific scenario that might trigger the system to misbehave - sudden power loss while state script in “retry later” state. Let me provide an example for clarification:

  • Update is received - Download state completed successfully, deployment enters ArtifactInstall_Enter state
  • During ArtifactInstall_Enter a state scripts is run to check for satisfied requirements - not all requirements are yet satisfied, so returns “retry later”
  • State scripts is scheduled to be rerun in X minutes
  • Device loses power (i.e. customer disconnects the power plug)
  • Device gets connected back to power outlet - as per documentation - Artifact enters an error state which results in failure of an entire deployment

Ideally, after power loss in such a scenario, it is desired for the update to still be applied under the same deployment.

Is it achievable to apply the deployment that was delayed via “retry later” state script after a power loss?

Hello @dmitrijsan,

This time it was me who went into radio silence for a while :sweat_smile:

Yes, that is a general safety feature of the Mender client: on power loss in any other state than reboot state will be considered a deployment failure. You would normally then trigger the same deployment from the Mender server. Note that this point would be the same with Update Control Maps.

However it can be in turn handled from the Mender server with more advance features like setting a number of retries for the deployment or defining a deployment for a dynamic group. Try these out!

Thanks a lot @TheYoctoJester and @lluiscampos. With combined knowledge and dedication from both ends, we managed to design a flexible and reliable way of delivering OTA updates.

Therefore, I would like to conclude things discussed so far, as we have managed to reach the conclusion and lay down a path that is suitable for our use case.

  • Whenever using mender-convert to convert plain RPi OS image with custom app included, it is possible to generate rootfs deployable artifact (*.mender file) as well a flashable image file (*.img file)
    • *.mender file can be used to deploy rootfs updates to the entire fleet
    • *.img file can be used to commission new client devices
  • When using mender-convert, it is possible to pass in all required artifact-provides. This can be used to version both:
    • rootfs
    • custom application
  • Whenever using mender-convert, it has a possibility to add in a bootstrap artifact to flash-able *.img file. Bootstrap artifact will automatically get applied on first boot-up of a device. Afterwards, client will notify the Mender Server about its data/version. bootstrap artifact was design for delta updates, however, it can be utilised in some other use cases.
  • Due to initial data provided via bootstrap artifact, it is now possible to create dependency chain of rootfs and custom application (e.g. Rootfs A released, provides App A. Then, App B released which depends on Rootfs A. Then Rootfs B released which provides App C …)

In terms of tackling graceful custom app shutdown before proceeding with an update - whenever a state script returns 21 - it will retry later. This functionality can be used to postpone an update:

  • State script at Download_Leave (state script must reside on the device) sends a termination signal to application
  • State script at ArtifaclInstall_Enter checks if custom app is terminated:
    • If custom app terminated already - returns 0. Update process continues
    • If custom app still running - returns 21. Update process retried later

Note: Ensure to adjust mender.conf variables StateScriptRetryIntervalSeconds and StateScriptRetryTimeoutSeconds as per the use case.

In some use cases, the time frame between sending the signal to custom app and custom app actually being terminated can be gigantic (i.e. more then 4 hours) and all types of issues might happen - i.e. loss of power. Hence, to recover from such a scenario, it is the best idea to use Deployment retry that is made available starting with Mender Professional plan. As a result, in case power loss happens during any deployment step, the deployment will be retried up to the amount of retry attempts specified within the deployment. It can be set up to a 100. This leaves plenty of room for deployment to succeed. If deployment fails for 100 times - it is reasonable to assume something went catastrophically bad with the deployment itself or the system was misused.

We tried to stick with Community version of Mender for as long as possible, however, deployment retry functionality would take a substantial amount of effort. Moreover, I highly believe that exposed APIs are not enough for a reliable deployment retry implementation. Therefore, deployment retry was a breaking point where a decision was made to ditch the Community version and transition to Professional.

I really wanted to post this knowledge for the people in the future reading this to get inspiration from this post. If all this knowledge was useful for us and we managed to implement a solid design for OTA updates - it might be helpful for others too.

2 Likes

Hi @dmitrijsan,

Thanks a lot for sharing. It is much appreciated for others to follow!
And welcome to Mender Professional! :slight_smile:

Greets,
Josef

1 Like

Hello again,

We have implemented the structure of the application as discussed above. A trial deployment was released, but it ended up in an INCONSISTENT state. Below are some details of the deployment:

  • Update module used: directory
  • Mender client version: 3.5.2
  • Mender server version: hosted (Professional), latest
Device deployment logs
2024-02-12 15:56:33 +0000 UTC info: Running Mender client version: 3.5.2
2024-02-12 15:56:34 +0000 UTC info: State transition: update-fetch [Download_Enter] -> update-store [Download_Enter]
2024-02-12 15:56:34 +0000 UTC info: Installer: authenticated digital signature of artifact
2024-02-12 15:56:38 +0000 UTC info: State transition: update-store [Download_Enter] -> update-after-store [Download_Leave]
2024-02-12 15:56:38 +0000 UTC info: State transition: update-after-store [Download_Leave] -> mender-update-control-refresh-maps [none]
2024-02-12 15:56:38 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> mender-update-control [none]
2024-02-12 15:56:38 +0000 UTC info: State transition: mender-update-control [none] -> update-install [ArtifactInstall]
2024-02-12 15:56:38 +0000 UTC info: Executing script: ArtifactInstall_Enter_00_retry_later
2024-02-12 15:56:39 +0000 UTC info: Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later
rootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later
Outside of update window.
Postponing to retry update install process later.

---------- end of script output
2024-02-12 15:56:39 +0000 UTC info: statescript: ArtifactInstall_Enter_00_retry_later requested a retry
2024-02-12 15:57:40 +0000 UTC info: Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later
rootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later
Outside of update window.
Postponing to retry update install process later.

---------- end of script output
2024-02-12 15:57:40 +0000 UTC info: statescript: ArtifactInstall_Enter_00_retry_later requested a retry
2024-02-12 15:58:41 +0000 UTC info: Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later
rootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later
Outside of update window.
Postponing to retry update install process later.

---------- end of script output
2024-02-12 15:58:41 +0000 UTC info: statescript: ArtifactInstall_Enter_00_retry_later requested a retry
2024-02-12 15:59:43 +0000 UTC info: Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later
rootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later
Outside of update window.
Postponing to retry update install process later.

---------- end of script output
.
.
. 
.
<It was retried for approx. 500 more times. So, truncated.>
.
.
.
.
2024-02-13 01:57:04 +0000 UTC info: statescript: ArtifactInstall_Enter_00_retry_later requested a retry
2024-02-13 01:58:05 +0000 UTC info: Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later
rootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later
Outside of update window.
Postponing to retry update install process later.

---------- end of script output
2024-02-13 01:58:05 +0000 UTC info: statescript: ArtifactInstall_Enter_00_retry_later requested a retry
2024-02-13 01:59:06 +0000 UTC info: Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later
rootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later
Outside of update window.
Postponing to retry update install process later.

---------- end of script output
2024-02-13 01:59:06 +0000 UTC info: statescript: ArtifactInstall_Enter_00_retry_later requested a retry
2024-02-13 02:00:23 +0000 UTC info: Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later
rootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later
Update window reached.
Continuing update process.

---------- end of script output
2024-02-13 02:00:27 +0000 UTC info: State transition: update-install [ArtifactInstall] -> mender-update-control-refresh-maps [none]
2024-02-13 02:00:27 +0000 UTC info: State transition: mender-update-control-refresh-maps [none] -> mender-update-control [none]
2024-02-13 02:00:27 +0000 UTC info: State transition: mender-update-control [none] -> update-commit [ArtifactCommit_Enter]
2024-02-13 02:00:27 +0000 UTC info: Device unauthorized; attempting reauthorization
2024-02-13 02:00:27 +0000 UTC info: Output (stderr) from command "/usr/share/mender/identity/mender-device-identity": using interface /sys/class/net/eth0
2024-02-13 02:00:27 +0000 UTC info: successfully received new authorization data from server https://eu.hosted.mender.io
2024-02-13 02:00:27 +0000 UTC info: shutting down websocket connections
2024-02-13 02:00:27 +0000 UTC info: Local proxy stopped
2024-02-13 02:00:27 +0000 UTC warning: error while sending close message: websocket: close sent
2024-02-13 02:00:27 +0000 UTC error: Proxy Shutdown failed: context canceled
2024-02-13 02:00:27 +0000 UTC warning: error while sending close message: websocket: close sent
2024-02-13 02:00:27 +0000 UTC info: Local proxy started
2024-02-13 02:00:27 +0000 UTC info: Reauthorization successful
2024-02-13 02:00:27 +0000 UTC info: State transition: update-commit [ArtifactCommit_Enter] -> update-after-first-commit [none]
2024-02-13 02:00:27 +0000 UTC error: Could not write state data to persistent storage: State data stored and retrieved maximum number of times
2024-02-13 02:00:27 +0000 UTC error: fatal error: State data stored and retrieved maximum number of times
2024-02-13 02:00:27 +0000 UTC error: State transition loop detected in state update-after-first-commit: Forcefully aborting update. The system is likely to be in an inconsistent state after this.
2024-02-13 02:00:27 +0000 UTC info: State transition: update-after-first-commit [none] -> update-status-report [none]
Artifact details
Mender artifact:
  Name: app-4.1.0
  Format: mender
  Version: 3
  Signature: signed and verified correctly
  Compatible devices: '[raspberrypi3]'
  Provides group: 
  Depends on one of artifact(s): []
  Depends on one of group(s): []
  State scripts:
    ArtifactCommit_Leave_00_start_app
    ArtifactFailure_Leave_00_start_app
    ArtifactInstall_Enter_00_retry_later

Updates:
    0:
    Type:   directory
    Provides:
	rootfs-image.app.version: 4.1.0
    Depends:
	rootfs-image.version: 2024.01.1
    Clears Provides: ["rootfs-image.app.*"]
    Metadata: Nothing
    Files:
      name:     update.tar
      size:     26685440
      modified: 2024-02-12 15:48:43 +0000 GMT
      checksum: <redacted>
      name:     dest_dir
      size:     17
      modified: 2024-02-12 15:48:43 +0000 GMT
      checksum: <redacted>
mender.conf settings
{
    "ArtifactVerifyKey": "<redacted>",
    "Connectivity": {
    "DisableKeepAlive": false,
    "IdleConnTimeoutSeconds": 30
    },
    "RetryPollIntervalSeconds": 300,
    "InventoryPollIntervalSeconds": 28800,
    "UpdatePollIntervalSeconds": 1800,
    "ModuleTimeoutSeconds": 14400,
    "ServerURL": "<redacted>",
    "TenantToken": "<redacted>",
    "UpdateLogPath": "/data/logs/mender/updates.log",
    "DeviceTypeFile": "/data/mender/device_type",
    "StateScriptRetryIntervalSeconds": 60,
    "StateScriptRetryTimeoutSeconds": 129600
}

I am unsure of the reason device ended up in the inconsistent state. I can clearly see that ArtifactInstall step succeeded (new files were placed on a device successfully, confirmed that manually), however it failed at ArtifactCommit_Enter. As a result, no state scripts were run at states ArtifactCommit_Leave and ArtifactFailure_Leave leaving a device with a non-running application.

I have researched the issue a bit and came across following post: Artifact installs failing on "Tried maximum amount of times"

In particular, those two comments:

Does that mean I must increase RetryPollCount? Currently, as per config above, it is not specified, so defaults to 3 * ceil(log2(RetryPollIntervalSeconds) + 1). As per mender.conf above, RetryPollIntervalSeconds is 300, meaning that RetryPollCount=3 * ceil(log2(300) + 1) => RetryPollCount=30. However, it was approx. 550 requests to retry later from a script (so approx. 550 minutes/9 hours of “delay”). Could this be the reason?

If not, what could be a potential cause of a failure in delivering an update as per scenario above?

Looking forward to hearing from you soon.

LE Could it be ModuleTimeoutSeconds?

Hey @dmitrijsan, I don’t think this is related to RetryPollCount at all. The scenario “State data stored and retrieved maximum number of times” should never be reached unless either mender itself or the update module has a bug and is looping during rebooting. But I don’t think that’s the case here, so this is a bug somewhere. I can’t immediately see where though.

I think the 550 requests to retry should not be a problem either. As far as I can see from the code, they do not increase the counter which eventually leads to the error message, so the problem must be somewhere else.

Is this repeatable on the next deployment?

1 Like

@kacf, thanks for a prompt response

Let me elaborate on the steps we tried to remedy the issue.

First, we tried to re-deploy the same artifact with a dir module - it was rejected as it was already installed. I am unsure why it was considered to be already installed as current device state was 4.1.0-INCONSISTENT while 4.1.0 is expected.

Then we tried to re-deploy the same dir module artifact with a Force update ticked - this successfully re-applied the artifact.

Then, to be 100% safe, we deployed current A/B artifact (again, with Force update ticked as it was technically installing the same thing it had before the update that brought the device into an INCONSISTENT state) to trigger a switch to a different partition. This successfully reverted a device back to a fully operational state but with the app being outdated (got 4.0.0, latest 4.1.0).

Then, I have created 2 new individual device deployments (one deployment for a “rescued” device, another deployment for a newly commissioned device just to see if it was a fluke) with the same release via the dir module - both of them completed successfully now.

The only difference between a case now and before:

  • Now it was 2 individual device deployments created. Both deployments succeeded.
  • Before it was a deployment created for a device group, containing a single device. This deployment failed.
  • Device that failed to apply/install a deployment was up for ~3-4 weeks before receiving a deployment. After a reboot, all the deployments for that device were succeeding so far.

Please find below the logs from the 2 deployments that succeeded:

Device deployment log 1
{"level":"info","message":"Running Mender client version: 3.5.2","timestamp":"2024-02-13T16:17:45Z"}
{"level":"info","message":"State transition: update-fetch [Download_Enter] -\u003e update-store [Download_Enter]","timestamp":"2024-02-13T16:17:45Z"}
{"level":"info","message":"Installer: authenticated digital signature of artifact","timestamp":"2024-02-13T16:17:46Z"}
{"level":"info","message":"State transition: update-store [Download_Enter] -\u003e update-after-store [Download_Leave]","timestamp":"2024-02-13T16:17:49Z"}
{"level":"info","message":"State transition: update-after-store [Download_Leave] -\u003e mender-update-control-refresh-maps [none]","timestamp":"2024-02-13T16:17:49Z"}
{"level":"info","message":"State transition: mender-update-control-refresh-maps [none] -\u003e mender-update-control [none]","timestamp":"2024-02-13T16:17:49Z"}
{"level":"info","message":"State transition: mender-update-control [none] -\u003e update-install [ArtifactInstall]","timestamp":"2024-02-13T16:17:49Z"}
{"level":"info","message":"Executing script: ArtifactInstall_Enter_00_retry_later","timestamp":"2024-02-13T16:17:49Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nOutside of update window.\n<redacted> App shutdown signal file NOT found.\nPostponing to retry update install process later.\n\n---------- end of script output","timestamp":"2024-02-13T16:17:50Z"}
{"level":"info","message":"statescript: ArtifactInstall_Enter_00_retry_later requested a retry","timestamp":"2024-02-13T16:17:50Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nOutside of update window.\n<redacted> App shutdown signal file NOT found.\nPostponing to retry update install process later.\n\n---------- end of script output","timestamp":"2024-02-13T16:18:51Z"}
{"level":"info","message":"statescript: ArtifactInstall_Enter_00_retry_later requested a retry","timestamp":"2024-02-13T16:18:51Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nOutside of update window.\n<redacted> App shutdown signal file NOT found.\nPostponing to retry update install process later.\n\n---------- end of script output","timestamp":"2024-02-13T16:19:52Z"}
{"level":"info","message":"statescript: ArtifactInstall_Enter_00_retry_later requested a retry","timestamp":"2024-02-13T16:19:52Z"}
.
.
.
.
<Approx. 500 retry later requests truncated>
.
.
.
.
.
.
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nOutside of update window.\n<redacted> App shutdown signal file NOT found.\nPostponing to retry update install process later.\n\n---------- end of script output","timestamp":"2024-02-14T01:58:55Z"}
{"level":"info","message":"statescript: ArtifactInstall_Enter_00_retry_later requested a retry","timestamp":"2024-02-14T01:58:55Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nOutside of update window.\n<redacted> App shutdown signal file NOT found.\nPostponing to retry update install process later.\n\n---------- end of script output","timestamp":"2024-02-14T01:59:56Z"}
{"level":"info","message":"statescript: ArtifactInstall_Enter_00_retry_later requested a retry","timestamp":"2024-02-14T01:59:56Z"}
{"level":"info","message":"Output (stdout) from command \"/var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\": <redacted>-<redacted>-app: signalled","timestamp":"2024-02-14T02:00:58Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nUpdate window reached.\n<redacted> App shutdown signal file NOT found.\nSending SIGUSR1 to <redacted> <redacted> app.\nSleeping 15 seconds before checking for <redacted> <redacted> App status again within the same script execution.\n<redacted> App shutdown signal file found.\nRemoving <redacted> App termianted and termination pending file.\nContinuing update process.\n\n---------- end of script output","timestamp":"2024-02-14T02:01:13Z"}
{"level":"info","message":"State transition: update-install [ArtifactInstall] -\u003e mender-update-control-refresh-maps [none]","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"State transition: mender-update-control-refresh-maps [none] -\u003e mender-update-control [none]","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"State transition: mender-update-control [none] -\u003e update-commit [ArtifactCommit_Enter]","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"Device unauthorized; attempting reauthorization","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"Output (stderr) from command \"/usr/share/mender/identity/mender-device-identity\": using interface /sys/class/net/eth0","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"successfully received new authorization data from server https://eu.hosted.mender.io","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"shutting down websocket connections","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"Local proxy stopped","timestamp":"2024-02-14T02:01:16Z"}
{"level":"warning","message":"error while sending close message: websocket: close sent","timestamp":"2024-02-14T02:01:16Z"}
{"level":"error","message":"Proxy Shutdown failed: context canceled\n","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"Local proxy started","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"Reauthorization successful","timestamp":"2024-02-14T02:01:16Z"}
{"level":"warning","message":"error while sending close message: websocket: close sent","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"State transition: update-commit [ArtifactCommit_Enter] -\u003e update-after-first-commit [none]","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"State transition: update-after-first-commit [none] -\u003e update-after-commit [ArtifactCommit_Leave]","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"State transition: update-after-commit [ArtifactCommit_Leave] -\u003e cleanup [none]","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"Executing script: ArtifactCommit_Leave_00_start_<redacted>_<redacted>_app","timestamp":"2024-02-14T02:01:16Z"}
{"level":"info","message":"Output (stdout) from command \"/var/lib/mender/scripts/ArtifactCommit_Leave_00_start_<redacted>_<redacted>_app\": <redacted>-<redacted>-app: started","timestamp":"2024-02-14T02:01:19Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactCommit_Leave_00_start_<redacted>_<redacted>_app\n<redacted>-<redacted>-app-4.1.0: Running ArtifactCommit_Leave_00_start_<redacted>_<redacted>_app\n\n---------- end of script output","timestamp":"2024-02-14T02:01:19Z"}
{"level":"info","message":"State transition: cleanup [none] -\u003e update-status-report [none]","timestamp":"2024-02-14T02:01:19Z"}
Device deployment log 2
{"level":"info","message":"Running Mender client version: 3.5.2","timestamp":"2024-02-13T16:12:53Z"}
{"level":"info","message":"State transition: update-fetch [Download_Enter] -\u003e update-store [Download_Enter]","timestamp":"2024-02-13T16:12:53Z"}
{"level":"info","message":"Installer: authenticated digital signature of artifact","timestamp":"2024-02-13T16:12:53Z"}
{"level":"info","message":"State transition: update-store [Download_Enter] -\u003e update-after-store [Download_Leave]","timestamp":"2024-02-13T16:12:57Z"}
{"level":"info","message":"State transition: update-after-store [Download_Leave] -\u003e mender-update-control-refresh-maps [none]","timestamp":"2024-02-13T16:12:57Z"}
{"level":"info","message":"State transition: mender-update-control-refresh-maps [none] -\u003e mender-update-control [none]","timestamp":"2024-02-13T16:12:57Z"}
{"level":"info","message":"State transition: mender-update-control [none] -\u003e update-install [ArtifactInstall]","timestamp":"2024-02-13T16:12:57Z"}
{"level":"info","message":"Executing script: ArtifactInstall_Enter_00_retry_later","timestamp":"2024-02-13T16:12:57Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nOutside of update window.\n<redacted> App shutdown signal file NOT found.\nPostponing to retry update install process later.\n\n---------- end of script output","timestamp":"2024-02-13T16:12:58Z"}
{"level":"info","message":"statescript: ArtifactInstall_Enter_00_retry_later requested a retry","timestamp":"2024-02-13T16:12:58Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nOutside of update window.\n<redacted> App shutdown signal file NOT found.\nPostponing to retry update install process later.\n\n---------- end of script output","timestamp":"2024-02-13T16:13:59Z"}
{"level":"info","message":"statescript: ArtifactInstall_Enter_00_retry_later requested a retry","timestamp":"2024-02-13T16:13:59Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nOutside of update window.\n<redacted> App shutdown signal file NOT found.\nPostponing to retry update install process later.\n\n---------- end of script output","timestamp":"2024-02-13T16:15:00Z"}
{"level":"info","message":"statescript: ArtifactInstall_Enter_00_retry_later requested a retry","timestamp":"2024-02-13T16:15:00Z"}
.
.
.
.
<Approx. 500 retry later requests truncated>
.
.
.
.
.
.
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nOutside of update window.\n<redacted> App shutdown signal file NOT found.\nPostponing to retry update install process later.\n\n---------- end of script output","timestamp":"2024-02-14T01:58:09Z"}
{"level":"info","message":"statescript: ArtifactInstall_Enter_00_retry_later requested a retry","timestamp":"2024-02-14T01:58:09Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nOutside of update window.\n<redacted> App shutdown signal file NOT found.\nPostponing to retry update install process later.\n\n---------- end of script output","timestamp":"2024-02-14T01:59:10Z"}
{"level":"info","message":"statescript: ArtifactInstall_Enter_00_retry_later requested a retry","timestamp":"2024-02-14T01:59:10Z"}
{"level":"info","message":"Output (stdout) from command \"/var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\": <redacted>-<redacted>-app: signalled","timestamp":"2024-02-14T02:00:12Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactInstall_Enter_00_retry_later\nrootfs-2024.01.1: Running ArtifactInstall_Enter_00_retry_later\nUpdate window reached.\n<redacted> App shutdown signal file NOT found.\nSending SIGUSR1 to <redacted> <redacted> app.\nSleeping 15 seconds before checking for <redacted> <redacted> App status again within the same script execution.\n<redacted> App shutdown signal file found.\nRemoving <redacted> App termianted and termination pending file.\nContinuing update process.\n\n---------- end of script output","timestamp":"2024-02-14T02:00:27Z"}
{"level":"info","message":"State transition: update-install [ArtifactInstall] -\u003e mender-update-control-refresh-maps [none]","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"State transition: mender-update-control-refresh-maps [none] -\u003e mender-update-control [none]","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"State transition: mender-update-control [none] -\u003e update-commit [ArtifactCommit_Enter]","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"Device unauthorized; attempting reauthorization","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"Output (stderr) from command \"/usr/share/mender/identity/mender-device-identity\": using interface /sys/class/net/eth0","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"successfully received new authorization data from server https://eu.hosted.mender.io","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"shutting down websocket connections","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"Local proxy stopped","timestamp":"2024-02-14T02:00:30Z"}
{"level":"warning","message":"error while sending close message: websocket: close sent","timestamp":"2024-02-14T02:00:30Z"}
{"level":"error","message":"Proxy Shutdown failed: context canceled\n","timestamp":"2024-02-14T02:00:30Z"}
{"level":"warning","message":"error while sending close message: websocket: close sent","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"Local proxy started","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"Reauthorization successful","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"State transition: update-commit [ArtifactCommit_Enter] -\u003e update-after-first-commit [none]","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"State transition: update-after-first-commit [none] -\u003e update-after-commit [ArtifactCommit_Leave]","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"State transition: update-after-commit [ArtifactCommit_Leave] -\u003e cleanup [none]","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"Executing script: ArtifactCommit_Leave_00_start_<redacted>_<redacted>_app","timestamp":"2024-02-14T02:00:30Z"}
{"level":"info","message":"Output (stdout) from command \"/var/lib/mender/scripts/ArtifactCommit_Leave_00_start_<redacted>_<redacted>_app\": <redacted>-<redacted>-app: started","timestamp":"2024-02-14T02:00:33Z"}
{"level":"info","message":"Collected output (stderr) while running script /var/lib/mender/scripts/ArtifactCommit_Leave_00_start_<redacted>_<redacted>_app\n<redacted>-<redacted>-app-4.1.0: Running ArtifactCommit_Leave_00_start_<redacted>_<redacted>_app\n\n---------- end of script output","timestamp":"2024-02-14T02:00:33Z"}
{"level":"info","message":"State transition: cleanup [none] -\u003e update-status-report [none]","timestamp":"2024-02-14T02:00:33Z"}

It is concerning that the deployment failed for no apparent reason on our first deployment, first device before sending these devices into the field.

Thanks for testing @dmitrijsan.

I don’t think the different deployment approaches make any difference. From the client perspective they all look the same. Either there is an artifact to install, or there isn’t, it just follows the server response.

That leaves me two theories, both not very good, but it is what it is:

  1. The deployment in the last two cases were made at a slightly later time than the original. If there is indeed some kind of problem with the state counter, it’s possible that has not counted as far as it did the first time, and therefore it doesn’t hit the problem. But again, based on what I’m reading in the code, it should not increase in this scenario, not to mention that 500 is way higher than it should ever accept, so this is still a pretty weak theory.

  2. Second, and I know this is a bit of a cop-out, but still: If it has only failed once, it could simply be a small corruption issue, either in storage or in memory, which made this trigger once because the counter was out of range.

Yeah, I told you both theories were not very good! :slightly_smiling_face:

But in any case, keep us posted if it happens again! Maybe something will reveal itself then!