Auto-commit for standalone mender updates

Hi Folks,
We are looking for feedback on some changes proposed for the meta-mender-tegra layer to help with some issues noticed in testing of one of the hardware variations. See https://github.com/OE4T/meta-mender-community/issues/7 and Failing updates on Tegra TX2 due to issues with `nvbootctrl` for background.

We’d like to reduce the chance a user accidentally forgets to mender -commit after startup by handling this automatically in a startup script, which will also handle some other mender related startup tasks in the case of the scenario described above.

The best way we’ve thought of to do this so far is by adding our own mender wrapper which touches a marker file when install succeeds and then checking for this marker and running mender commit automatically on startup as a part of a one-shot service. We then rename the mender utility as mender_override and place our script in the place of the mender utility, so users can just run mender -install as they normally would.

Curious if there are better ways of doing this than we haven’t found yet, or if there are any plans or discussions about making a more universal way of doing this that wouldn’t be specific to this one hardware platform.

Hi @dwalkes.

I also found that the description here helped me understand the problem.

I guess one of the problems might be this:

After 7 boot attempts without committing, the NVIDIA bootloader rolls back to the previous slot. However, since mender and u-boot are not synchronized, the mender rootfs slot still references the wrong slot.

nvbootctrl I guess has a retry counter of 7 while Mender only has bootlimit=1, and Mender would do rollback in U-Boot after one reboot/reset.

So I am wondering if would help to align the the retry counters in both tools instead, and if that would be enough?

Thanks @mirzak

We are still trying to understand the problem, as you can see in the thread of that PR. However, we are likely limited in what we can do to control NVIDIA bootloader logic and the solution will likely include requirements about how mender -commit and other commands to NVIDIAs bootloader tools align with each other.

The main thing I was looking for in this thread was feedback about the general idea of automating mender -commit for standalone installations, which may be useful for this specific issue but would also likely be useful for other caeses. I’ve found myself forgetting to mender -commit and accidentally rolling back before on other platforms for instance, which can be confusing. It seems like the ideal scenario would be for the mender client to remember when an install happened standalone and, as an option perhaps, automagically commit for you on the next boot. This is essentially what we’ve done with the changes proposed in the PR. I’m just wondering if there’s some reason philosophically or otherwise why this isn’t done already or why it’s a bad idea to do it the way we are proposing.

I think there is no single truth in when exactly a deployment can be deemed successful.

Possible answers include:

  • Device is able to boot into userspace.
  • Device is able to reach a certain host on some network.
  • Device is able to read and store (but not necessarily transmit) location or environmental data.

I don’t have an specific an opinion on this, but one possibility you might want to think about is to disable the marker-checking-and-mender-commiting-service by default. That way,

  • users can make a decision on build time
  • users don’t get this by default, that is, they actively have to think about the desired behavior.

I think that’s true, however I believe that in the case of a mender update deployed from the mender server the commit is going to happen once connection to the mender server happens. So that does become a single truth for mender server based updates. Note that I’m still not completely sure how this works, see related conversation at this link

If you don’t use mender server and are using mender standalone, most likely not for production, it arguably might be less important how you define “successful” and having some default might be helpful, especially for the tegra uboot case.

Fundamentally I agree with this approach, the issue is just that with the uboot tegra case you can get in confusing states that are difficult to diagnose and recover from, especially for someone new to the platform.

I’m wondering if we could use systemd boot-complete.target for this.