High Rate of Debian Failures

Hello there, I have been updating my beaglebone via debian update on Mender Hosted Demo, and I’m having a really high rate of failure for debian updates. Tagging @tranchitella because this is a repost from a support thread.

An example log of the common failure(s) can be found in the attached text file (renamed to yaml so I could upload it, sorry!).
deb_failure_bbb.yaml (6.3 KB)

The highlights are:
1. It seems the device rebooted in ArtifactCommit, but this is unrelated to your modified Update Module, since it doesn’t contain that state
a. I do have the NeedsArtifactReboot state/option set to Automatic, and it should reboot once the Debian is installed. Looking at the timestamps, and knowing that our gateway takes about thirty seconds to a minute to reboot, I’m thinking it’s rebooting as I would have expected here – is that an incorrect assumption on when the reboot should occur? Looking at the logs, is it possible that Mender is just handling the expected reboot poorly on startup? I’ve provided the logs I think are interesting below. At 14:55:04 it definitely begins the reboot process, and then a little less than thirty seconds later at 14:55:32, the client starts back up again and everything after the startup message is within the same second. Is there something we can add to the ArtifactCommit state like a sync or sleep that might help?

2021-05-20 14:55:04 +0000 UTC info: State transition: update-install [ArtifactInstall] -> reboot [ArtifactReboot_Enter]

2021-05-20 14:55:04 +0000 UTC info: Rebooting device(s)

2021-05-20 14:55:32 +0000 UTC info: Running Mender client version: 2.3.1

2021-05-20 14:55:32 +0000 UTC info: State transition: init [none] -> after-reboot [ArtifactReboot_Leave]

2021-05-20 14:55:32 +0000 UTC info: Running Mender client version: 2.3.1

2021-05-20 14:55:32 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> after-reboot [ArtifactReboot_Leave]

2021-05-20 14:55:32 +0000 UTC error: Mender shut down in state: after-reboot

2021-05-20 14:55:32 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> update-commit [ArtifactCommit_Enter]

2. According to the logs, the deb update module is not available anymore, are you shipping it as part of your package?
a. No, the only Mender file we edit on the gateway with our Module Updates is a Mender Inventory file – more details on what we do touch are in item 4. Is it possible that Mender is misconfigured and the deb failure is a waterfall effect of a different missing dependency?
3. We can also see networking issues in contacting the server, and this makes it hard to understand what triggered what.
a. Are these networking issues server side? I noticed that the errors mention ‘server misbehaving’. This gateway is connected directly to my router and hasn’t had other networking issues that I’m aware of, however if there is a connectivity check you’d like me to run to verify, I have no problem doing that.
4. Can you please provide us more information about the content of the package you are installing?
a. Sure thing; we are installing our own binary that runs as a service on the gateway. I’ll add the list of files/directories we touch for your reference:
i. /usr/local/bin – this is where we house the binaries called by our service
ii. /usr/local/env – we hold some files here for configuration use
iii. /usr/lib/systemd/system – we modify our service file here
iv. /etc/artis – our own directory for data, this is symlinked to the persistent /data partition
v. /usr/local/man/man1 – we add our documentation here
vi. /etc/dhcp/dhclient-enter-hooks.d – for ethernet configurations
vii. /etc/init.d – custom initialization script lives here
viii. /usr/share/mender/inventory – add a custom inventory file that holds data we send to Mender

After pulling together the above responses, I also monitored an update via serial terminal – as soon as I was able to log in, I checked if the file /usr/share/mender/modules/v3/deb was there, and it was – and shortly after the update failed saying it doesn’t exist. I did notice that the error doesn’t include a state to check when calling the file, is it possible there is a file of states getting checked during the process that is missing instead?

Edit: forgot to post the current deb file contents:

ubuntu@artis-mender:~$ cat /usr/share/mender/modules/v3/deb 

#!/bin/sh
set -e

STATE="$1"
FILES="$2"
case "$STATE" in
    ArtifactInstall)
        yes | dpkg -i "$FILES"/files/*.deb
        ;;
    NeedsArtifactReboot)
        echo "Automatic"
        ;;
esac
exit 0

Tagging others for feedback: @oleorhagen @kacf @lluiscampos

1 Like

It is indeed difficult to understand what is going on here, but several things point towards the filesystem(s) being incorrectly mounted somehow. The reason I say this is first of all, the missing /usr/share/mender/modules/v3/deb, which can’t be explained by the package you are installing, it seems. Second, and perhaps even more importantly, at the end of the run, /var/lib/mender/modules/v3/payloads/0000/tree is gone. This should never happen under any circumstances, so I suspect something is not mounted correctly.

A couple of questions following from that:

  1. What is /var/lib/mender in your setup, a directory or a symlink? And if it is a symlink, where does it point?
  2. Are you otherwise using any “tricks”, like symlinking the client database to a different location?
  3. As a debugging measure, can you add a call to mount at the top of the Update Module? This should give us a clear overview of mounted filesystems at each step.
1 Like

Thanks for the response,

  1. Yes, /var/lib/mender is symlinked like so:

lrwxrwxrwx 1 root root 12 Feb 6 2019 /var/lib/mender → /data/mender

  1. I’m thinking you mean /data mean you mean the client database so let me know if the following information was what you were looking for; we are also symlinking /etc/artis -> /data/artis/, as well as a copy of a shadow file and hostname file to /etc
  2. I’m attaching a printout of the mount command for each state in a failed scenario. (Deployment ID is 56cd369d-4b93-4c3f-89d4-a21d0c32aa8b):
    results.yaml (15.0 KB)

Thanks, everything you posted seems totally correct, and none of the extra modifications should have any effect on the outcome as far as I can see.

Are you using systemd? Does /lib/systemd/system/mender-client.service.wants/mender-client-data-dir.service exist?

Other than that I’m starting to run out of ideas. If you do figure out what it is, please post it, I’m quite curious as to what could cause this!

We are indeed using systemd – the output of looking in that directory is as follows:

ubuntu@artis-mender:~$ ls /lib/systemd/system/mender-client.*
/lib/systemd/system/mender-client.service

This indicates to me that we don’t have that directory, nor the respective service that should exist there. Is this something we need to add manually?

Can you try adding data.mount to both the Wants= and the After= sections in /lib/systemd/system/mender-client.service? Like this:

[Unit]
Description=Mender OTA update service
Wants=network-online.target data.mount
After=systemd-resolved.service network-online.target data.mount

...

Make sure this is done both in the image you are upgrading from, and the one you are upgrading to.

I added it to /lib/systemd/system/mender-client.service and wasn’t able to see a single success in 4 or 5 retries. I noticed that there is also /lib/systemd/system/mender.service and updated it there as well. It succeeded twice but then gave the missing deb error again. Should I have both systemd service files?

Also, I still don’t seem to have the /lib/systemd/system/mender-client.service.wants/ directory, although I thought this might have been generated by adding the above.

Final question, to clarify Make sure this is done both in the image you are upgrading from, and the one you are upgrading to. – if I’m doing a module update, I just need the modifications done in the active partition, correct?

I’m pretty sure over time the mender.service file got renamed to mender-client.service so i’m not sure why you are seeing both files installed.

1 Like

This is very interesting, it could mean that two Mender processes are running at the same time on your system. And they won’t know about each other because they have different names. If you run pgrep -l mender, does it list more than one process?

Try removing the /lib/systemd/system/mender.service file, this is the old name which should not be used anymore.

1 Like

Yep, this fixed it! I’m currently at 6 for 6 with successful debian updates. I think what happened is that we were on 1.7.0, and then when the Update Modules feature came out, we updated to 2.2.0 and didn’t know that things were renamed/there was an old service file to remove. Thanks for your help!

Edit: I’d also like to note that this solved some other issues we were seeing as well: dpkg locks and server misbehaving issues

1 Like

I’m happy that it worked out!

FYI, I’m looking into ways to prevent this from happening again.

1 Like

Wonderful! Thanks a lot again!

@lizziemac Good that you figure it out!

Can you share how did you end up in this situation? How has been the process of updating from 1.7.0 to 2.2.2? Are you using yocto, Mender’s deb package, Debian’s deb package ?

Hi @lluiscampos ,
We were using the .deb packages provided via https://docs.mender.io – I went back just now to Downloads | Mender documentation, and it looks like the 2.2.0 that we would have downloaded from that location was patched. Since 2.2.0, we have also updated to 2.3.1 on most of our clients, using the same method. We did not create the original image with Yocto

@lizziemac Mmm. IIRC the deb package was introduced on Mender 2.1, at the time of 1.7.0 we did not provide a deb package. Do you know how was the initial integration with Mender done, back then?

Sorry for insisting, but I am trying to find more about your update path so that we can better prevent this from happening to other users.

Oh my mistake - I wasn’t the one to integrate it so I just assumed. Looking at the notes from the dev that did it (and has since left), I think we used the mender-convert tool and then made some u-boot changes; unfortunately I’m not sure I could tell you what release branch of it was used at the time if that was the case.