Robustness of deployments

We recently decided to try paid Mender Professional for our company and we are in the process of testing it for our needs. As we will use Mender on Raspberry Pi 3 and its microSD card, one issue we wanted to fully cover is filesystem corruption which is not often but it is still possible. So far with read-only filesystem and dual partitions we tackled this pretty well.
We use the latest Mender (we need the diff updates at the end but we test rootfs-image) on a custom Yocto warrior build.

After initiating a cloud rootfs-image deployment, we tricked filesystem to be deployed to not being able to boot with 2 methods (see below) and the results were the same with 2 serious issues observed from our end.
Pi rolls-back successfully but “mender show-artifact” reports the new artifact (which never got to boot) and committs that to Mender Cloud server and changing Deployment status to “SUCCESS” giving a totally wrong impression if done remotely.

The 2 methods used are:

  1. While cloud server pushes the download and mender client writes it to /dev/mmcblk0p3 we run this to mess up the inactive partition (p3)
    dd if=/dev/random of=/dev/mmcblk0p3
  2. Built a custom artifact without /sbin/init that should fail to boot

We also tried the above on an unmodified core-image-base for MACHINE=raspberrypi3 with the same results too.

From our point of view we see 2 potential problems here.

  1. The deployment is only hashed on download before written to disk (p3) “assuming” trying to write it actually writes it intact. This assumption leaves a huge security hole if attacker with access can modify filesystem before the reboot (maybe to bypass a possible upcoming security fix update). Even on signed deployments!
  2. “mender show-artifact” reports a value which is not “realtime” but stored beforehand hoping it will be valid when needed

Has anyone tried this case before?

Hi @bender

  1. The deployment is only hashed on download before written to disk (p3) “assuming” trying to write it actually writes it intact. This assumption leaves a huge security hole if attacker with access can modify filesystem before the reboot (maybe to bypass a possible upcoming security fix update). Even on signed deployments!

Mender is not intending to protect against all attack scenarios. The Mender signature guarantees that the image that is downloaded and install is properly signed but we cannot guarantee that at some later time it hasn’t been changed. You would need some kind of Secure-Boot setup to protect against this kind of attack.

  1. “mender show-artifact” reports a value which is not “realtime” but stored beforehand hoping it will be valid when needed

This one is strange. I’ve been unable to reproduce it with either of the scenarios you mentioned. In all cases my device rolls back and shows the proper artifact name. @kacf @lluiscampos @oleorhagen do you guys have any suggestions how to proceed troubleshooting this?

Drew

I thought with respects to point 2, this couldn’t happen as the meta-data that is referenced for the mender show-artifact command was stored on their respective A/B partitions unless this has changed recently?

How can I check this?

In older versions of the client i believe it to be stored in /etc/mender/artifact_info however from other discussions i believe this to be deprecated in newer versions of the client to use a lmdb database. I don’t know where that is stored as i have no devices on a newer enough version of mender to be able to check this. Just speculating but If the lmdb database is now stored on the /data partition then that would be where i would start looking as its now a shared resource and you may of uncovered a bug.

isn’t this in
/data/mender/mender-store
?
Although is not to be human readable, I tried a simple cat on it and I could see the new artifact name before AND after the unsuccessful update.

looks like you have found its location.

As to how its able to able to set the wrong values in the shared database under your testing conditions, i will have to defer to the mender-team as i don’t use a version that has a database.

is there a way to properly query this db or simply parse all values?

i havent use these, but the source code for lmdb-utils for working with a database seems to be here

An attacker could just as easily do this while the filesystem was being read back to verify the checksum, so reading it back provides no extra security. At best it provides a slight protection against corruption due to hardware failure, but given the huge performance hit, I do not believe this is worthwhile to do.

Yes, this is correct, the database stores the current name, and the upcoming name (how else would it remember what the upcoming name is going to be?). However mender show-artifact will always show the current name.

If mender show-artifact does not produce the correct value, then please share the exact steps to reproduce this.

Could you try that?

There is nothing Mender can do to prevent you from writing to the filesystem if you have root access. The only remedy I can think of here is to limit the number of processes that have root access to a minimum.

I cannot reproduce your issue, for me it rolls back just fine and reverts to the previous image. What messages do you get on the console when doing this?

And which Mender client version are you using?

Our stress test is mainly focusing on SD card corruption which is not very easy to emulate otherwise.

I am waiting for @drewmoseley to provide a rootfs image disk to flash and a test release to retry on my end

root@pi5798:~ # mender --version
2.3.0-dirty runtime: go1.12.9

As /data partition is indeed shared, storing variables in db (/data/mender/mender-store) that should be relative to active/inactive partitions should not work unless they are absolutely linked to each partition addressed directly (/dev/mmcblk0p2 or /dev/mmcblk0p3).
Could that be the root of the issue here?

@bender, since your tests with my image did not exhibit the issue, let’s see if we can figure out what your image is doing differently that would interfere with the rollback in this case.

Do you have any scripting that invokes the mender binary directly or that manipulates the systemd service for Mender? Can you share any state scripts that you have implemented?

Do you have any scripting in either U-Boot or Linux that manipulates the U-Boot environment? Can you share the details?

Drew

For others who may be able to contribute, the build I provided to @bender was straight from the Mender hub post. Configured to connect to hosted Mender. No other customizations, state scripts, etc.

Nope

No state scripts are implemented (yet).

No

Any advice is welcome as we don’t know where else to look to spot the difference.

At the moment I’m at a loss. Can you provide the serial console output from the board through the course of the deployment and the associated reboots?

Did you ever get around to dumping the contents of the lmdb database to inspect it?

Copied mender-store over to my machine and installed my distro’s ready package (did not build from source)

$ mdb_load -f mender-store mender-test
mdb_load: line 1: unexpected format

Any hints? There was a mender-store.lock file present in the Pi too.