Deployment management for failures and new devices

Hi everyone,

We recently started using Mender 1.7 to deploy updates to a fleet of devices. Update process is going well but we are facing two issues related to deployment and we were wondering if there are “good practices” on how to handle those cases.

The first issue is how to handle preauthorized device going online. We released an important update and communicated about it, now some users are unpacking the device and expect the update to start right away. However if we created the deployment before they started using the device, they won’t receive it until next deployment, and in the meantime are likely to contact customer support to ask why they cannot update.
We were thinking about periodically querying the server to watch for newly added devices and creating a deployment, but this seems quite unpractical and not really the “good” way to use deployments. Are there any other solutions?

The second issue is how to handle a device that encountered an error during update deployment (for instance if the user restarted the device while downloading artifact). The point is similar to the previous one: if the deployment failed, the user thinks that the update will restart right away, but there is no longer a pending deployment for this device.
So same question, do we have to monitor externally failures via the APIs and relaunch automatically a deployment for those devices, or is there a better approach to this?

Hi @vhubert welcome to Mender hub. I’m glad to hear you are successfully using the product.

As for your first issue, we don’t currently have a mechanism to handle automatic updates on first boot. We have discussed implementing something here but I don’t believe there is any specific delivery planned for it. @eystein do you have any more details you can share?

Regarding the second issue, yes we deliberately do not retry updates. From Mender’s perspective it’s not always possible to know why the deployment failed so we err on the side of caution and require operator intervention in this case. I believe there are some plans for handling certain instances of this kind of issue when we can definitively determine why the installation was interrupted.

At the moment the API is the best approach for both of these issues.

Drew

Hi @vhubert,

Thanks for the feedback, both of these are on the Mender roadmap, though likely commercial features (Mender Professional and/or Enterprise) because they relate to larger scale and automated deployment management (it is possible to build your own automation on top of Mender Open Source for this like Drew mentions).

Regarding the second use case, there is a button in the UI that will recreate the deployment for failed devices. We’re also planning to automate this part. Are there specific conditions you think that the deployment should be retried for a device, or should it always be if it fails? How many times do you think Mender should try?

Thanks again for your input and feedback.

Hi,

Thanks for the quick feedback. We will then use the API to handle those cases automatically.

Regarding automatically retrying the update in some failure cases, we noticed at least one situation where it seems relevant (and I suppose not too hard to detect): when the device is restarted during artifact download. The mender client throws error: got invalid entrypoint into the state machine: state: update-store and indicates a failure. However it would make sense on the server side to just consider this as an interruption and restart the deployment. It would be even better if the client could resume download as when network is temporarily unavailable (thus saving bandwidth for the server and time for the client).

For the retry limit something around 5 could make sense: we give it a few try but if it keeps failing it probably indicates a situation in which the device cannot properly update, thus requiring manual inquiry.