Getting "error: fw_setenv returned failure: exit status 2" after "Committing update" OTA update

surprising that above issue is observing in S3 bucket artifact storage in hosted mender server instead of EC2 minio artifact storage & also here first OTA is success but second or onward OTA is failing with the same image/artifacts.

Below is the sequence:

  • Flash the new image on the target device
    
  • small OTA push  i.e. single file OTA update (~1 mins)
    
  • Full OS mender OTA update (~30 mins) ==> success deployment
    
  • again small OTA push  i.e. single file OTA update (~1 mins)
    
  • again Full OS mender OTA update (~30 mins) ==> fail deployment as below screenshot
    

Hey @Rohita83, sorry for the delay, busy times!

There are several things here which are a bit suspicious. Let’s look at the first issue to begin with:

Error rebooting device: signal: terminated

What the client expects, is that the reboot command succeeds, and then at some point later, the client is killed. But it looks like what’s happening here is that the command fails, but the reboot still happens. I suspect the reason is that the reboot command itself gets killed as part of the reboot, and never gets to report the success. The client is not taking this scenario into account.

If you have a chance to patch the client, can you try to apply this patch:

diff --git a/system/system.go b/system/system.go
index a96d7d8..c676058 100644
--- a/system/system.go
+++ b/system/system.go
@@ -33,10 +33,7 @@ func NewSystemRebootCmd(command Commander) *SystemRebootCmd {
 }
 
 func (s *SystemRebootCmd) Reboot() error {
-       err := s.command.Command("reboot").Run()
-       if err != nil {
-               return err
-       }
+       s.command.Command("reboot").Run()
 
        // Wait up to ten minutes for reboot to kill the client, otherwise the
        // client may mistake a successful return code as "reboot is complete,

Note that it needs to be applied to the client that you upgrade from, not the one upgrade to. So you may need to install it manually first, or flash the device from scratch.

@kacf Thanks for informative response.

Since i am using mender_convert v2.4.0 tool directly to make artifact which having mender_client version v2.6.0.

To apply above patch into mender_client, i need to modify mender_convert file mender-convert-modify and create deb package for this. Please let me know if i am wrong. could you please assist me to make this deb package.

He @Rohita83, can you try this package?

It is a development build, so don’t use it in production. But let’s see if it fixes the problem.

@kacf

Since we are on mender Client 2.6, there is a change in the changelog as below screenshot that was dated back in April 2021, titled “mender-binary-delta (1.2.1)” where it mentions that there has been a fix for the incorrect status during a rollback when the bootloader is being used. Would this change be a part of the Client 2.6 release? Or would I have to upgrade the server to 2.7 to get these changes? is the above change relevant to the issue that I am experiencing?

I will also test the change that u have mentioned above and will come back with the result.

https://docs.mender.io/2.7/release-information/release-notes-changelog

The mender-binary-delta version is independent of both the client and the server, it should work with either. It depends on which version you have installed from here.

We are switching to a new changelog structure in the development version of our docs, split into component pages. This avoids everything being lumped together like it is now, which is hopefully easier to understand and follow. Check it out here.

@kacf Thanks for quick response.

As per you provided mender deb package, I have modified mender convert utility to include this mender client package to create mender artifact as below

Note: Here i have rename your client package name as deb_name=mender-client_2.6.0-LOCAL_amd64.deb

Attached mender convert log for reference.
convert.log.yml (26.3 KB)

After creation artifact (i.e. alto_p3_upgV2.6_20210601_Testserver_TEST
.mender) & upload it into server GUI successfully, I have triggered OTA update twice but unfortunately it failed, attached logs screenshots for reference below as log-1

I have also directory update OTA to change in Home partition and then again trigger OTA but failed again as below log-2 for reference.

ARTIFACT_NAME=“HOME_TEST_v1.0”
DEVICE_TYPE=“x86_64”
OUTPUT_PATH=“HOME_TEST_v1.0.mender”
DEST_DIR="/home/alto/TEST/"
FILE_TREE=“Test_local”

./directory-artifact-gen -n ${ARTIFACT_NAME} -t ${DEVICE_TYPE} -d ${DEST_DIR} -o ${OUTPUT_PATH} ${FILE_TREE}

Note: Additionally I have seen that OTA deployment (multiple attempts) is not consistent in mender v2.6 with S3 bucket artifact storage as below observations.

Waiting for your input to crack this above issue.

@kacf
Additional information, earlier I was using mender version 2.6 with self-hosted Minio (EC2) based storage configuration that OTA deployment time was long ~2 hrs 30 mins but i was not experience this above issue. Once i have switched AWS S3 based storage configuration, the OTA deployment time has down ~30 mins only but having this above mention issue randomly (sometimes its success and sometimes fail). Here I don’t think this error related to S3 based configuration but just it is my observation.

Hey @Rohita83,

Ok, so at least we got rid of that error. I’d really like to isolate the problem on the client side. Any chance you can copy the artifact onto the device, or to a memory stick, and install it using standalone mode? This would give us a chance to figure out exactly what is going on.

Assuming you have put the artifact in /data/artifact.mender, do these steps to do a manual install:

mender install /data/artifact.mender
reboot

# At this stage, maybe try "fw_printenv" to verify that the
# environment is working correctly.

mender commit

Does that work?

@kacf
I have followed the steps as told above and output as below:

$ fw_printenv # o/p before install
bootcount=1
mender_boot_part=2
upgrade_available=0
mender_uboot_separator=
mender_boot_part_hex=2

$ sudo mender install alto_p3_upgV2.6_20210529_Testserver_v2-x86_64-mender.mender 
INFO[0000] Loaded configuration file: /var/lib/mender/mender.conf 
INFO[0000] Loaded configuration file: /etc/mender/mender.conf 
INFO[0000] Mender running on partition: /dev/sda2       
INFO[0000] Start updating from local image file: [alto_p3_upgV2.6_20210529_Testserver_v2-x86_64-mender.mender] 
INFO[0000] No public key was provided for authenticating the artifact 
INFO[0000] Opening device "/dev/sda3" for writing       
INFO[0000] Native sector size of block device /dev/sda3 is 512 bytes. Mender will write in chunks of 1048576 bytes 
.............................................................. - 100 %
INFO[1483] All bytes were successfully written to the new partition 
INFO[1483] The optimized block-device writer wrote a total of 49729 frames, where 23694 frames did need to be rewritten (i.e., skipped) 
INFO[1485] Wrote 52143587328/52143587328 bytes to the inactive partition 
INFO[1485] Enabling partition with new image installed to be a boot candidate: 3 

$ fw_printenv # After reboot,  o/p as below
bootcount=1
mender_boot_part=3
upgrade_available=0
mender_uboot_separator=
mender_boot_part_hex=3

$ mender commit
INFO[0000] Loaded configuration file: /var/lib/mender/mender.conf
INFO[0000] Loaded configuration file: /etc/mender/mender.conf
INFO[0000] Mender running on partition: /dev/sda3
ERRO[0000] Could not commit Artifact: There is nothing to commit
WARN[0000] There is nothing to commit

Refer post: Fixed wrong error produced by rootfs-image commit by hacpa · Pull Request #607 · mendersoftware/mender · GitHub

In this scenario, what does “fw_printenv” show after the install but before the reboot?

Drew

What @drewmoseley said!

And also, what does mount say, both before and after the reboot?

@drewmoseley
“fw_printenv” o/p after the install but before the reboot as below

$ fw_printenv
bootcount=0
mender_boot_part=3
upgrade_available=1
mender_uboot_separator=
mender_boot_part_hex=3

@kacf @drewmoseley

mount output, both before and after the reboot as below in attached file.
mount_log.yml (16.8 KB)

$ sudo mender commit

INFO[0000] Loaded configuration file: /var/lib/mender/mender.conf 
INFO[0000] Loaded configuration file: /etc/mender/mender.conf 
INFO[0000] Mender running on partition: /dev/sda2       
ERRO[0000] Could not commit Artifact: There is nothing to commit 
WARN[0000] There is nothing to commit

Second time attempt logs with new flash image on target device for reference as
mount-log-attempt-2.yml (12.2 KB)

Could you please suggest me to crack this issue.

I have to admit that I’m starting to run out of ideas. Something is modifying the boot loader environment behind your back, it’s definitely not behaving as it should. Are you using any state scripts, or modifying the sda1 partition in any way?

@kacf
Thanks for the quick response.

I am not using state scripts as below
image

I have used the directory update module i.e. directory-artifact-gen as below to modify the Home partition to create TEST folder.

ARTIFACT_NAME=“HOME_TEST_v1.0”
DEVICE_TYPE=“x86_64”
OUTPUT_PATH=“HOME_TEST_v1.0.mender”
DEST_DIR="/home/alto/TEST/"
FILE_TREE=“Test_local”

./directory-artifact-gen -n ${ARTIFACT_NAME} -t ${DEVICE_TYPE} -d ${DEST_DIR} -o ${OUTPUT_PATH} ${FILE_TREE}

Note: I have attached grub.cfg file here for reference to get any clue.
grub.cfg.yml (7.8 KB)

How are you generating the initial UEFIIMG? I know you are using mender-convert from the earlier discussion but where are you getting the input image? Is it an image that already has Mender installed?

Drew

@drewmoseley
I am using the latest mender_convert v2.4 here. I am using the input image as custom ubuntu18.04 that I am using last ~1 year without any issue.
Here Mender is not installed in an image. mender is installing through mender_convert utility only.

@drewmoseley @kacf @mirzak

Could anyone please suggest that why the statement “fw_setenv failure” (OS x86 Ubuntu18.04 as below error logs) is sometimes is failing & sometimes is working in the case of the self-hosted AWS S3 bucket storage instead of minio with a full OS filesystem update & application update (using directory-artifact-gen & single-file-artifact-gen scripts) as below screenshots in both versions v2.6 & v2.7?

2021-06-08 15:40:29 +0000 UTC info: Committing update
2021-06-08 15:40:30 +0000 UTC error: fw_setenv returned failure: exit status 2
2021-06-08 15:40:30 +0000 UTC error: Failed to probe the U-Boot environment for which separator to use. Got error: exit status 2
2021-06-08 15:40:30 +0000 UTC error: transient error: update commit failed: exit status 2
2021-06-08 15:40:30 +0000 UTC info: State transition: update-commit [ArtifactCommit_Enter] → rollback [ArtifactRollback]
2021-06-08 15:40:30 +0000 UTC info: Performing rollback
2021-06-08 15:40:31 +0000 UTC info: Rolling back to the inactive partition (3).
2021-06-08 15:40:32 +0000 UTC error: fw_setenv returned failure: exit status 1
2021-06-08 15:40:32 +0000 UTC error: Failed to probe the U-Boot environment for which separator to use. Got error: exit status 1
2021-06-08 15:40:32 +0000 UTC error: Rollback failed: exit status 1
2021-06-08 15:40:32 +0000 UTC error: fatal error: exit status 1


@Rohita83: I’m working on making it easier to get information about failures like this by logging output from child processes better. See this pull request. Perhaps you can retry with a development build once that has been merged.

The exit status 1 by itself is not giving a lot of information.