Handling random reboots

Description

Hi we are using mender update-module to update some of the binaries on our systems. Basically mender downloads the artifact and updates our agent running over DBus of the availability of the update and the update is handled by our custom update agent ,who puts the system in to recovery mode and updates the binary files and reboots. The update module has NeedsArtifactReboot to yes and would verify the content of the file to see if the update is succesfull.

In the normal positive flow all works good but some time when the device is rebooted after Download state, the ArtifactVerifyReboot state fails due to the verification happening . Is there a way to identify the reboots caused by the actual update against the random reboots the user does ?

Let me just see if I have understood the question correctly: Mender finds an update, the update module sends the info to the custom updater, then Mender proceeds to ArtifactReboot, where it reboots and enters recovery mode. In this mode, your custom update agent is working, and while it is working, the user pulls the plug and reboots the device?

If the ArtifactVerifyReboot section is detecting this as an invalid install, isn’t this more or less the same as detecting a random reboot? It should never happen if the update is installed correctly, so it must be because the device rebooted before it was supposed to.

ok, here’s the update module for reference. As could be seen in the ArtifactReboot state mender makes a blocking call to custom update agent. The custom update agent would show a notification on the UI and wait for user to accept the update and then proceeds with the update and reboot. Note, mender as such does nothing in this stage except making the blocking call to exe. When the system reboots , ArtifactVerifyReboot Is used to verify the update status by comparing the artifact_info file.

#!/system/bin/sh
set -e
STATE="$1"
FILES="$2"
DEST="/var/downloads"
SYS_ARTIFACT_FILE="/system/bin/artifact_info"
case "$STATE" in
    NeedsArtifactReboot)
        echo "Yes"
        ;;
    ArtifactReboot)
        echo "Files: $FILES, Dest: $DEST"
        cp "$FILES"/files/*.zip $DEST/ota.zip
        cp "$FILES"/files/artifact_info $DEST/artifact_info
        ARTIFACT=`basename "$FILES"/files/*.zip` #extract artifact name from file present in artifact
        VERSION=$(cat $DEST/artifact_info | cut -d "=" -f 2)
    
        echo "ARTIFACT: $ARTIFACT, VERSION: $VERSION"
        /system/bin/call-updater -artifactFile $DEST/$ARTIFACT -artifactVersion $VERSION
        ;;
        ArtifactVerifyReboot)
        cmp -s $DEST/artifact_info $SYS_ARTIFACT_FILE || exit 1
        ;;
esac

Now , if a reboot happens when mender is in ArtifactReboot state (not triggered by the custom agent), let’s say user just reboots the system ArtifactVerifyReboot state fails and the deployment is marked failed. What I would want is a way by which unless the user consents and the agent reboots ,all other reboots should be kind of ignored(?) and the mender should resume at the blocking call after reboot.

Here are the possible solutions I can think of

  • Use exit code 21 in ArtifactVerifyReboot if the comparison fails, this would enable the retry the stage but the notification to custom agent would still be missing(may be I should call the updater here as well ?)
  • Move the call to custom agent to Download stage and handle validation in later stages(?)

Any other idea’s or even a hint at doing this in a better would be much appreciated.

Sanjay

Now I understand the problem. Unfortunately I don’t think what you’re asking is possible. As can be seen in the state diagram for power loss, a spontaneous reboot always leads to an error state except the very final ArtifactCommit state. It is basically just the simplest handling of spontaneous reboots: Assume it means failure.

It’s something we would like to improve, and we have been discussing some solutions. However, it’s still early stage and I can’t give you any time estimate for when this would be implemented.

A possible solution for you is to use the server API to automatically schedule new deployments if an old one has failed due to a spontaneous reboot. The device will have to re-download the image though.

Thanks @kacf for the response. Is automatic re-deployment available as part of mender professional service ?

Here is the feature matrix for all Mender plans.
TLDR; Yes :slight_smile: