Unexpected rollback updating device

Hi all,

We’re getting a rollback too frequently for it to seem correct when updating devices. Given the time between the messages in the quoted log, I’d have to say I’m pretty sure the network is up, especially since it had no problem returning the failure log to the server. Does anybody have any ideas why this might be happening. We’re using the Mender 2.0.1 server and a 1.7.1 client. I’ve attached a log from our failed update.

Blockquote

2019-09-12 12:48:38 +0000 UTC info: Running Mender version 2.0.0
2019-09-12 12:48:38 +0000 UTC debug: handle update fetch state
2019-09-12 12:48:38 +0000 UTC debug: status reported, response 204 No Content
2019-09-12 12:48:38 +0000 UTC debug: Received fetch update response &{200 OK 200 HTTP/1.1 1 1 map[Server:[openresty/1.13.6.2] Last-Modified:[Thu, 12 Sep 2019 12:44:19 GMT] X-Xss-Protection:[1; mode=block] Strict-Transport-Security:[max-age=63072000; includeSubdomains; preload] Content-Length:[149346304] Accept-Ranges:[bytes] Etag:["7fd0983e60dff2c7d83dec84094bb85d"] Vary:[Origin] X-Content-Type-Options:[nosniff] Date:[Thu, 12 Sep 2019 12:48:38 GMT] Content-Security-Policy:[block-all-mixed-content] X-Amz-Request-Id:[15C3B1BC7A07C5E4] X-Frame-Options:[DENY] Content-Type:[application/vnd.mender-artifact] Connection:[keep-alive]] 0xc00033a820 149346304 [] false false map[] 0xc0004a6400 0xc0001f6630}+
2019-09-12 12:48:38 +0000 UTC info: State transition: update-fetch [Download_Enter] -> update-store [Download_Enter]
2019-09-12 12:48:38 +0000 UTC debug: handle update install state
2019-09-12 12:48:38 +0000 UTC debug: status reported, response 204 No Content
2019-09-12 12:48:38 +0000 UTC debug: Read data from device manifest file: device_type=intel-corei7-64
2019-09-12 12:48:38 +0000 UTC debug: Current manifest data: intel-corei7-64
2019-09-12 12:48:38 +0000 UTC info: Update Module path "/usr/share/mender/modules/v3" could not be opened (open /usr/share/mender/modules/v3: no such file or directory). Update modules will not be available
2019-09-12 12:48:38 +0000 UTC debug: checking if device [intel-corei7-64] is on compatibile device list: [intel-corei7-64]
2019-09-12 12:48:38 +0000 UTC debug: installer: processing script: ArtifactReboot_Enter_90_symlink-hostname
2019-09-12 12:48:38 +0000 UTC debug: installer: processing script: ArtifactReboot_Leave_00_fix-hostname
2019-09-12 12:48:38 +0000 UTC debug: installer: processing script: ArtifactReboot_Leave_10_flash-poweruc
2019-09-12 12:48:38 +0000 UTC debug: installer: processing script: ArtifactReboot_Leave_20_update-camera
2019-09-12 12:48:38 +0000 UTC debug: installer: processing script: ArtifactReboot_Leave_80_start-exec
2019-09-12 12:48:38 +0000 UTC debug: installer: successfully read artifact [name: test-image; version: 2; compatible devices: [intel-corei7-64]]
2019-09-12 12:48:38 +0000 UTC debug: Trying to install update of size: 4110417920
2019-09-12 12:48:38 +0000 UTC debug: Active partition: /dev/sda3
2019-09-12 12:48:38 +0000 UTC debug: Detected inactive partition /dev/sda2, based on active partition /dev/sda3
2019-09-12 12:48:38 +0000 UTC info: native sector size of block device /dev/sda2 is 512, we will write in chunks of 1048576
2019-09-12 12:48:38 +0000 UTC info: opening device /dev/sda2 for writing
2019-09-12 12:48:38 +0000 UTC info: partition /dev/sda2 size: 4110417920
2019-09-12 12:52:15 +0000 UTC info: wrote 4110417920/4110417920 bytes of update to device /dev/sda2
2019-09-12 12:52:16 +0000 UTC debug: status reported, response 204 No Content
2019-09-12 12:52:16 +0000 UTC info: State transition: update-store [Download_Enter] -> update-after-store [Download_Leave]
2019-09-12 12:52:16 +0000 UTC info: State transition: update-after-store [Download_Leave] -> update-install [ArtifactInstall]
2019-09-12 12:52:17 +0000 UTC debug: status reported, response 204 No Content
2019-09-12 12:52:17 +0000 UTC debug: Inactive partition: /dev/sda2
2019-09-12 12:52:17 +0000 UTC debug: Marking inactive partition (/dev/sda2) as the new boot candidate.
2019-09-12 12:52:17 +0000 UTC info: Enabling partition with new image installed to be a boot candidate: 2
2019-09-12 12:52:17 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=
2019-09-12 12:52:17 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:]
2019-09-12 12:52:17 +0000 UTC debug: Marking inactive partition as a boot candidate successful.
2019-09-12 12:52:17 +0000 UTC info: State transition: update-install [ArtifactInstall] -> reboot [ArtifactReboot_Enter]
2019-09-12 12:52:17 +0000 UTC debug: start executing script: ArtifactReboot_Enter_90_symlink-hostname
2019-09-12 12:52:17 +0000 UTC debug: status reported, response 204 No Content
2019-09-12 12:52:17 +0000 UTC debug: status reported, response 204 No Content
2019-09-12 12:52:17 +0000 UTC debug: handling reboot state
2019-09-12 12:52:17 +0000 UTC debug: status reported, response 204 No Content
2019-09-12 12:52:17 +0000 UTC info: rebooting device(s)
2019-09-12 12:52:17 +0000 UTC info: Mender rebooting from active partition: /dev/sda3
2019-09-12 12:52:38 +0000 UTC info: Running Mender version 2.0.0
2019-09-12 12:52:38 +0000 UTC info: Update Module path "/usr/share/mender/modules/v3" could not be opened (open /usr/share/mender/modules/v3: no such file or directory). Update modules will not be available
2019-09-12 12:52:38 +0000 UTC info: State transition: init [none] -> after-reboot [ArtifactReboot_Leave]
2019-09-12 12:52:38 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=
2019-09-12 12:52:38 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:]
2019-09-12 12:52:38 +0000 UTC debug: Have U-Boot variable: upgrade_available=1
2019-09-12 12:52:38 +0000 UTC debug: List of U-Boot variables:map[upgrade_available:1]
2019-09-12 12:52:38 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> after-reboot [ArtifactReboot_Leave]
2019-09-12 12:52:38 +0000 UTC debug: handling state after reboot
2019-09-12 12:52:38 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> update-commit [ArtifactCommit_Enter]
2019-09-12 12:52:38 +0000 UTC debug: start executing script: ArtifactReboot_Leave_00_fix-hostname
2019-09-12 12:52:38 +0000 UTC error: failed to report status: Put https://menderserver.no/api/devices/v1/deployments/device/deployments/21db410f-e38b-4b49-a783-1c0bd3a91633/status: dial tcp: lookup [menderserver.no](http://menderserver.no/) on [2001:4860:4860::8888]:53: dial udp [2001:4860:4860::8888]:53: connect: network is unreachable
2019-09-12 12:52:38 +0000 UTC error: statescript: can not send start status to server: reporting status failed: Put https://menderserver.no/api/devices/v1/deployments/device/deployments/21db410f-e38b-4b49-a783-1c0bd3a91633/status: dial tcp: lookup [menderserver.no](http://menderserver.no/) on [2001:4860:4860::8888]:53: dial udp [2001:4860:4860::8888]:53: connect: network is unreachable
2019-09-12 12:52:38 +0000 UTC debug: start executing script: ArtifactReboot_Leave_10_flash-poweruc
2019-09-12 12:52:38 +0000 UTC error: failed to report status: Put https://menderserver.no/api/devices/v1/deployments/device/deployments/21db410f-e38b-4b49-a783-1c0bd3a91633/status: dial tcp: lookup [menderserver.no](http://menderserver.no/) on [2001:4860:4860::8888]:53: dial udp [2001:4860:4860::8888]:53: connect: network is unreachable
2019-09-12 12:52:38 +0000 UTC error: statescript: can not send start status to server: reporting status failed: Put https://menderserver.no/api/devices/v1/deployments/device/deployments/21db410f-e38b-4b49-a783-1c0bd3a91633/status: dial tcp: lookup [menderserver.no](http://menderserver.no/) on [2001:4860:4860::8888]:53: dial udp [2001:4860:4860::8888]:53: connect: network is unreachable
2019-09-12 12:52:48 +0000 UTC debug: start executing script: ArtifactReboot_Leave_20_update-camera
2019-09-12 12:54:17 +0000 UTC error: failed to report status: Put https://menderserver.no/api/devices/v1/deployments/device/deployments/21db410f-e38b-4b49-a783-1c0bd3a91633/status: EOF
2019-09-12 12:54:17 +0000 UTC error: statescript: can not send start status to server: reporting status failed: Put https://menderserver.no/api/devices/v1/deployments/device/deployments/21db410f-e38b-4b49-a783-1c0bd3a91633/status: EOF
2019-09-12 12:55:21 +0000 UTC info: Running Mender version 2.0.0
2019-09-12 12:55:21 +0000 UTC error: Update was interrupted in state: after-reboot
2019-09-12 12:55:21 +0000 UTC info: Update Module path "/usr/share/mender/modules/v3" could not be opened (open /usr/share/mender/modules/v3: no such file or directory). Update modules will not be available
2019-09-12 12:55:21 +0000 UTC info: State transition: init [none] -> rollback [ArtifactRollback]
2019-09-12 12:55:21 +0000 UTC debug: transitioning to error state
2019-09-12 12:55:21 +0000 UTC info: performing rollback
2019-09-12 12:55:21 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=
2019-09-12 12:55:21 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:]
2019-09-12 12:55:21 +0000 UTC debug: Have U-Boot variable: upgrade_available=0
2019-09-12 12:55:21 +0000 UTC debug: List of U-Boot variables:map[upgrade_available:0]
2019-09-12 12:55:21 +0000 UTC debug: will try to rollback reboot the device
2019-09-12 12:55:21 +0000 UTC info: State transition: rollback [ArtifactRollback] -> rollback-reboot [ArtifactRollbackReboot_Enter]
2019-09-12 12:55:21 +0000 UTC info: rebooting device(s) after rollback
2019-09-12 12:55:21 +0000 UTC info: Mender rebooting from inactive partition: /dev/sda3
2019-09-12 12:57:10 +0000 UTC info: Running Mender version 2.0.0
2019-09-12 12:57:10 +0000 UTC info: Update was interrupted in state: rollback-reboot
2019-09-12 12:57:10 +0000 UTC info: Update Module path "/usr/share/mender/modules/v3" could not be opened (open /usr/share/mender/modules/v3: no such file or directory). Update modules will not be available
2019-09-12 12:57:10 +0000 UTC info: State transition: init [none] -> verify-rollback-reboot [ArtifactRollbackReboot_Leave]
2019-09-12 12:57:10 +0000 UTC debug: transitioning to error state
2019-09-12 12:57:10 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=
2019-09-12 12:57:10 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:]
2019-09-12 12:57:10 +0000 UTC debug: Have U-Boot variable: upgrade_available=0
2019-09-12 12:57:10 +0000 UTC debug: List of U-Boot variables:map[upgrade_available:0]
2019-09-12 12:57:10 +0000 UTC info: State transition: verify-rollback-reboot [ArtifactRollbackReboot_Leave] -> after-rollback-reboot [ArtifactRollbackReboot_Leave]
2019-09-12 12:57:10 +0000 UTC debug: handling state after rollback reboot
2019-09-12 12:57:10 +0000 UTC info: State transition: after-rollback-reboot [ArtifactRollbackReboot_Leave] -> update-error [ArtifactFailure]
2019-09-12 12:57:10 +0000 UTC debug: handle update error state
2019-09-12 12:57:10 +0000 UTC info: State transition: update-error [ArtifactFailure] -> cleanup [Error]
2019-09-12 12:57:10 +0000 UTC debug: Handling Cleanup state
2019-09-12 12:57:10 +0000 UTC info: State transition: cleanup [Error] -> update-status-report [none]
2019-09-12 12:57:10 +0000 UTC debug: handle update status report state
2019-09-12 12:57:10 +0000 UTC error: failed to report status: Put https://menderserver.no/api/devices/v1/deployments/device/deployments/21db410f-e38b-4b49-a783-1c0bd3a91633/status: dial tcp: lookup [menderserver.no](http://menderserver.no/) on [2001:4860:4860::8888]:53: dial udp [2001:4860:4860::8888]:53: connect: network is unreachable
2019-09-12 12:57:10 +0000 UTC error: error reporting update status: reporting status failed: Put https://menderserver.no/api/devices/v1/deployments/device/deployments/21db410f-e38b-4b49-a783-1c0bd3a91633/status: dial tcp: lookup [menderserver.no](http://menderserver.no/) on [2001:4860:4860::8888]:53: dial udp [2001:4860:4860::8888]:53: connect: network is unreachable
2019-09-12 12:57:10 +0000 UTC error: failed to send status to server: transient error: reporting status failed: Put https://menderserver.no/api/devices/v1/deployments/device/deployments/21db410f-e38b-4b49-a783-1c0bd3a91633/status: dial tcp: lookup [menderserver.no](http://menderserver.no/) on [2001:4860:4860::8888]:53: dial udp [2001:4860:4860::8888]:53: connect: network is unreachable
2019-09-12 12:57:10 +0000 UTC info: State transition: update-status-report [none] -> update-retry-report [none]
2019-09-12 13:02:10 +0000 UTC debug: wait complete
2019-09-12 13:02:10 +0000 UTC info: State transition: update-retry-report [none] -> update-status-report [none]
2019-09-12 13:02:10 +0000 UTC debug: handle update status report state
2019-09-12 13:02:10 +0000 UTC debug: status reported, response 204 No Content
2019-09-12 13:02:10 +0000 UTC debug: attempting to upload deployment logs for failed update

Hi @ckelloug2 from this line, it looks like something is not happy in your DNS resolution setup in this case. It may be instructive to add some debug into your ArtifactReboot state scripts to dump status of connman or whatever network manager you are using. Output of “ifconfig” or “ip addr” as well as the contents of /etc/resolv.conf would also be useful.

You will probably need to dump that info into the /data partition and then view it after the rollback from an ssh connection to the device.

Drew

@drewmoseley @mirzak @eystein: Even though I use rpi3, I just hit the same issue where the device decides to rollback even though everything is correct and for some reason my After and Before scripts aren’t doing what they should (sometimes they work and I’m 100% sure they are correct).

2020-04-15 14:20:54 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=1
2020-04-15 14:20:54 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:1]
2020-04-15 14:20:54 +0000 UTC debug: Have U-Boot variable: mender_saveenv_canary=1
2020-04-15 14:20:54 +0000 UTC debug: List of U-Boot variables:map[mender_saveenv_canary:1]
2020-04-15 14:20:54 +0000 UTC debug: Marking inactive partition as a boot candidate successful.
2020-04-15 14:20:54 +0000 UTC info: State transition: update-install [ArtifactInstall] -> reboot [ArtifactReboot_Enter]
2020-04-15 14:20:54 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
2020-04-15 14:20:54 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
2020-04-15 14:20:54 +0000 UTC debug: handling reboot state
2020-04-15 14:20:55 +0000 UTC debug: status reported, response 204 No Content
2020-04-15 14:20:55 +0000 UTC info: rebooting device(s)
2020-04-15 14:20:55 +0000 UTC info: Mender rebooting from active partition: /dev/mmcblk0p2
2020-04-15 11:35:33 +0000 UTC info: Running Mender version 2.2.0
2020-04-15 11:35:33 +0000 UTC info: State transition: init [none] -> after-reboot [ArtifactReboot_Leave]
2020-04-15 11:35:33 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=1
2020-04-15 11:35:33 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:1]
2020-04-15 11:35:33 +0000 UTC debug: Have U-Boot variable: mender_saveenv_canary=1
2020-04-15 11:35:33 +0000 UTC debug: List of U-Boot variables:map[mender_saveenv_canary:1]
2020-04-15 11:35:33 +0000 UTC debug: Have U-Boot variable: upgrade_available=1
2020-04-15 11:35:33 +0000 UTC debug: List of U-Boot variables:map[upgrade_available:1]
2020-04-15 11:35:33 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> after-reboot [ArtifactReboot_Leave]
2020-04-15 11:35:33 +0000 UTC debug: handling state after reboot
2020-04-15 11:35:33 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> update-commit [ArtifactCommit_Enter]
2020-04-15 11:35:33 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
2020-04-15 11:35:33 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
2020-04-15 11:35:33 +0000 UTC debug: handle update commit state
2020-04-15 11:35:33 +0000 UTC error: failed to report status: Put https://hosted.mender.io/api/devices/v1/deployments/device/deployments/de2f91d2-0574-4165-8a4b-616809d23370/status: dial tcp: lookup hosted.mender.io on 213.191.128.8:53: dial udp 213.191.128.8:53: connect: network is unreachable
2020-04-15 11:35:33 +0000 UTC error: error reporting update status: reporting status failed: Put https://hosted.mender.io/api/devices/v1/deployments/device/deployments/de2f91d2-0574-4165-8a4b-616809d23370/status: dial tcp: lookup hosted.mender.io on 213.191.128.8:53: dial udp 213.191.128.8:53: connect: network is unreachable
2020-04-15 11:35:33 +0000 UTC error: Failed to send status report to server: transient error: reporting status failed: Put https://hosted.mender.io/api/devices/v1/deployments/device/deployments/de2f91d2-0574-4165-8a4b-616809d23370/status: dial tcp: lookup hosted.mender.io on 213.191.128.8:53: dial udp 213.191.128.8:53: connect: network is unreachable
2020-04-15 11:35:33 +0000 UTC info: State transition: update-commit [ArtifactCommit_Enter] -> update-pre-commit-status-report-retry [ArtifactCommit_Enter]
2020-04-15 14:21:07 +0000 UTC info: Running Mender version 2.2.0
2020-04-15 14:21:07 +0000 UTC error: Update was interrupted in state: update-commit
2020-04-15 14:21:07 +0000 UTC info: State transition: init [none] -> rollback [ArtifactRollback]
2020-04-15 14:21:07 +0000 UTC debug: transitioning to error state
2020-04-15 14:21:07 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
2020-04-15 14:21:07 +0000 UTC info: performing rollback

So the issue here is that at the state State transition: after-reboot [ArtifactReboot_Leave] -> update-commit [ArtifactCommit_Enter] the device tries to contact the API and fails because the network isn’t ready. I have scripts handling that one.

cat rootfs_overlay/./etc/mender/scripts/ArtifactCommit_Enter_00_Wait_For_Network
#!/bin/sh
# Wait until the network is connected and the time is synced
# before syncing with the server

RETRY_LATER=21
OK=0

ping 8.8.8.8 -c1 > /dev/null 2>&1
if [ $? != 0 ]; then
    exit ${RETRY_LATER}
fi

TIME_SYNC_STATUS=$(timedatectl | grep NTP | awk '{print $NF}')
if [ "${TIME_SYNC_STATUS}" != "yes" ] && [ "${TIME_SYNC_STATUS}" != "active" ]; then
    exit ${RETRY_LATER}
fi

exit ${OK}

To me it seems like the script isn’t fired. How can I debug that?

Btw, why not make the behaviour where the device actually checks for the network before trying to PUT to the API the default one? This feels cumbersome, to be honest.

The Artifact* scripts should not go into the target filesystem. They are actually added into the artifact itself using the mender-artifact command. See https://docs.mender.io/2.3/artifacts/state-scripts#root-file-system-and-artifact-scripts for more details.

In this case you will need to adapt the call to mender-artifact in configs/mender_convert_config to add your state script. This can be done with the “-s” option:

–script value, -s value Full path to the state script(s). You can specify multiple scripts providing this parameter multiple times.

@mirzak is there a preferred way to do whit with mender-convert?

Yeah, I would like to have a definite way to install all those needed scripts via mender-convert, since I don’t use mender-artifact at all. I just have a golden image I’m updating and using mender-convert on it to create artifacts and full images.

I tried adding Artifact files now with mender-artifact. Used that artifact file to update and it still failed.

mender-artifact cp /path/to/some/file /path/to/artifact:/etc/mender/scripts/Artifact<some_name>

Here is the entire log of an upgrade, rollback and a failed api probe:
https://pastebin.com/raw/Ft6tikdw

Using “cp” is not the right approach. You need to modify the mender_convert_config to add the ‘–script value’ option. The files are packaged in the metadata of the artifact and not in the actual filesystem contents.

Drew

@drewmoseley, true, thanks for explaining!

Seems like my issue is yet another check for network

57 2020-04-15 11:35:33 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> update-commit [ArtifactCommit_Enter]
58 2020-04-15 11:35:33 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
59 2020-04-15 11:35:33 +0000 UTC debug: start executing script: ArtifactReboot_Leave_00_Wait_For_Network
60 2020-04-15 11:35:33 +0000 UTC error: failed to report status: Put https://hosted.mender.io/api/devices/v1/deployments/device/deployments/00678764-45df-428b-8434-534ca420156e/status: dial tcp: lookup h osted.mender.io on 213.191.128.8:53: dial udp 213.191.128.8:53: connect: network is unreachable

From my perspective (end-user), this should be handled by mender client.

Current script:

#!/bin/sh

RETRY_LATER=21
OK=0

ping 8.8.8.8 -c1 > /dev/null 2>&1
if [ ? != 0 ]; then exit {RETRY_LATER}
fi

TIME_SYNC_STATUS=$(timedatectl | grep NTP | awk '{print NF}') if [ "{TIME_SYNC_STATUS}" != “yes” ] && [ "{TIME_SYNC_STATUS}" != "active" ]; then exit {RETRY_LATER}
fi

exit ${OK}

I guess nslookup is needed here also.

Yes, I can see that would help with this situation. Any chance you feel like testing out a fix and submitting a PR?

Drew

Completely swamped with core work on my business, but I’d gladly test the feature if someone commited it!

Hi @mkozjak,

I wanted to take a step back here and analyze your first log, and specifically this section is interesting to me,

2020-04-15 11:35:33 +0000 UTC debug: handle update commit state
2020-04-15 11:35:33 +0000 UTC error: failed to report status: Put https://hosted.mender.io/api/devices/v1/deployments/device/deployments/de2f91d2-0574-4165-8a4b-616809d23370/status: dial tcp: lookup hosted.mender.io on 213.191.128.8:53: dial udp 213.191.128.8:53: connect: network is unreachable
2020-04-15 11:35:33 +0000 UTC error: error reporting update status: reporting status failed: Put https://hosted.mender.io/api/devices/v1/deployments/device/deployments/de2f91d2-0574-4165-8a4b-616809d23370/status: dial tcp: lookup hosted.mender.io on 213.191.128.8:53: dial udp 213.191.128.8:53: connect: network is unreachable
2020-04-15 11:35:33 +0000 UTC error: Failed to send status report to server: transient error: reporting status failed: Put https://hosted.mender.io/api/devices/v1/deployments/device/deployments/de2f91d2-0574-4165-8a4b-616809d23370/status: dial tcp: lookup hosted.mender.io on 213.191.128.8:53: dial udp 213.191.128.8:53: connect: network is unreachable
2020-04-15 11:35:33 +0000 UTC info: State transition: update-commit [ArtifactCommit_Enter] -> update-pre-commit-status-report-retry [ArtifactCommit_Enter]
2020-04-15 14:21:07 +0000 UTC info: Running Mender version 2.2.0
2020-04-15 14:21:07 +0000 UTC error: Update was interrupted in state: update-commit

There is more going on here then the Mender client failing because of network. First of all the Mender client will not give up after one connection try, and it normally would keep retrying here. How long it will wait for to be able to connect to the Mender server depends largely on what RetryPollIntervalSeconds is set to.

To me it looks like something else is rebooting the device here outside of Mender client knowledge,

'2020-04-15 11:35:33 +0000 UTC error: Failed to send status report to server: transient error: reporting status failed: Put https://hosted.mender.io/api/devices/v1/deployments/device/deployments/de2f91d2-0574-4165-8a4b-616809d23370/status: dial tcp: lookup hosted.mender.io on 213.191.128.8:53: dial udp 213.191.128.8:53: connect: network is unreachable
2020-04-15 11:35:33 +0000 UTC info: State transition: update-commit [ArtifactCommit_Enter] -> update-pre-commit-status-report-retry [ArtifactCommit_Enter]
2020-04-15 14:21:07 +0000 UTC info: Running Mender version 2.2.0

This is also indicated that the Mender client reports the following:

2020-04-15 14:21:07 +0000 UTC error: Update was interrupted in state: update-commit

So would probably start with investigating this before trying anything else.

Regarding the network check, I do not agree that it should be handled by the Mender client. The Mender client already has a grace period where it will try to connect to the Mender server and if it timeouts it will mark the update as failed. This normally covers it.

Further I think it is best to handle service dependencies in your init system, e.g if you want to delay a service start until internet is available. I would recommend you to read this which contains some good information about the complexity of this and how this can be partly solved in systemd. Though I do not think this should be needed for it to work, and could be used as an optimization in some cases.

1 Like

@mirzak, thank you for the explanation. I decided to go with a clean new image and test artifact updates on it. Just to make things clear - we should include non Artifact* files via rootfs_overlay (mender-convert) and Artifact* files by including path to those files (for example artifact_scripts/) in configs/mender_convert_config. Are there any cons in including Artifact* files in both configs/mender_convert_config and rootfs_overlay?

@mirzak https://gist.github.com/mkozjak/b00d9850c54f2cceaaf52fa376b33c2c
This is the entire log of the latest update after placing all the files in appropriate locations. It’s somewhat different from the one that you observed.

Entire syslog: https://gist.github.com/77674f76f97143cead551a3e317e6318
11:00:00 +0000 UTC is the time when the device gets rebooted after it writes to a passive partition, p3.

The first error ever happened is

Apr 16 12:00:19 raspberrypi mender[295]: time=“2020-04-16T12:00:19+01:00” level=error msg=“Update was interrupted in state: after-reboot” module=state

which relates to

2020-04-16 07:07:49 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=1
2020-04-16 07:07:49 +0000 UTC debug: List of U-Boot variables:map[mender_check_saveenv_canary:1]
2020-04-16 07:07:49 +0000 UTC debug: Have U-Boot variable: mender_saveenv_canary=1
2020-04-16 07:07:49 +0000 UTC debug: List of U-Boot variables:map[mender_saveenv_canary:1]
2020-04-16 07:07:49 +0000 UTC debug: Have U-Boot variable: upgrade_available=1
2020-04-16 07:07:49 +0000 UTC debug: List of U-Boot variables:map[upgrade_available:1]
2020-04-16 07:07:49 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> after-reboot [ArtifactReboot_Leave]
2020-04-16 07:07:49 +0000 UTC debug: handling state after reboot
2020-04-16 07:07:49 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> update-commit [ArtifactCommit_Enter]
2020-04-16 07:07:49 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
2020-04-16 07:07:49 +0000 UTC debug: start executing script: ArtifactReboot_Leave_00_Wait_For_Network
2020-04-16 07:07:49 +0000 UTC error: failed to report status: Put https://hosted.mender.io/api/devices/v1/deployments/device/deployments/d16df6e6-8dfa-4193-a933-f1f2151d391e/status: dial tcp: lookup hosted.mender.io on 213.191.128.8:53: dial udp 213.191.128.8:53: connect: network is unreachable
2020-04-16 07:07:49 +0000 UTC error: statescript: can not send start status to server: reporting status failed: Put https://hosted.mender.io/api/devices/v1/deployments/device/deployments/d16df6e6-8dfa-4193-a933-f1f2151d391e/status: dial tcp: lookup hosted.mender.io on 213.191.128.8:53: dial udp 213.191.128.8:53: connect: network is unreachable
2020-04-16 07:07:49 +0000 UTC warning: No total time set for the retry-scripts’ timeslot. Falling back to default: 30m0s
2020-04-16 07:07:49 +0000 UTC info: statescript: ArtifactReboot_Leave_00_Wait_For_Network requested a retry
2020-04-16 07:07:49 +0000 UTC warning: No timeout interval set for the retry-scripts. Falling back to default: 1m0s
2020-04-16 11:00:19 +0000 UTC info: Running Mender version 2.2.0
2020-04-16 11:00:19 +0000 UTC error: Update was interrupted in state: after-reboot
2020-04-16 11:00:19 +0000 UTC info: State transition: init [none] -> rollback [ArtifactRollback]
2020-04-16 11:00:19 +0000 UTC debug: transitioning to error state
2020-04-16 11:00:19 +0000 UTC debug: statescript: timeout for executing scripts is not defined; using default of 1h0m0s seconds
2020-04-16 11:00:19 +0000 UTC info: performing rollback
2020-04-16 11:00:19 +0000 UTC debug: Have U-Boot variable: mender_check_saveenv_canary=1

Might be my checking script isn’t right?

#!/bin/sh

RETRY_LATER=21
OK=0

ping 8.8.8.8 -c1 > /dev/null 2>&1
if [ ? != 0 ]; then exit {RETRY_LATER}
fi

nslookup hosted.mender.io > /dev/null 2>&1
if [ ? != 0 ]; then exit {RETRY_LATER}
fi

TIME_SYNC_STATUS=$(timedatectl | grep NTP | awk '{print NF}') if [ "{TIME_SYNC_STATUS}" != “yes” ] && [ "{TIME_SYNC_STATUS}" != "active" ]; then exit {RETRY_LATER}
fi

exit ${OK}

My RetryPollIntervalSeconds is set to 30. I use a hosted mender server (hosted.mender.io).
If nothing won’t help, I guess the only real way is to lock systemd tight so all mender services can’t run without proper connectivity.

I should point out that I’m using the latest mender-convert 2.0.x commit.

Sorry if I’m bombing the thread, I can open up a new one if you demand it.

Still the same problem,

# This is the last entry from the Mender client
2020-04-16 07:07:49 +0000 UTC warning: No timeout interval set for the retry-scripts. Falling back to default: 1m0s

# The time difference between above and below entries indicates to me that the device was rebooted and it was not the Mender client that initiated it.

2020-04-16 11:00:19 +0000 UTC info: Running Mender version 2.2.0
2020-04-16 11:00:19 +0000 UTC error: Update was interrupted in state: after-reboot

And I do not think that this has anything to do with network access as the Mender client never gets a chance to timeout and restart, something beats it to it.

So do you have something that reboots the device on first boot after an update?

1 Like

I do have a disk resizing script that I install via mender-convert --config.

This one:

# Real life SD cards typically have less than they advertise. First off, they
# often use a base of 1000 instead of 1024, and even then they are often smaller
# than advertised. The number below is based on a conservative target of 7.9GB
# (that's mathematical GB, base 1000), converted to MiB, rounding down.
MENDER_STORAGE_TOTAL_SIZE_MB=7534

# Use all there is, which gets us almost, but not quite, to 500MiB free space
# (about 480MiB at the time of writing).
IMAGE_ROOTFS_SIZE=-1
IMAGE_ROOTFS_EXTRA_SPACE=0
IMAGE_OVERHEAD_FACTOR=1.0

# Best compression there is!
# MENDER_COMPRESS_DISK_IMAGE=lzma

#
# Resize the data partition to fill the remaining space, using parted, with systemd
# Grow ext fs afterwards
#
function grow_data_partition_and_filesystem() {
  log_info "Adding systemd init script to run parted and resize the data partition on boot"
  log_info "to fill all the available space on the storage media (for systemd below version 242)"
  run_and_log_cmd "sudo mkdir -p work/rpi/etc/systemd/system/"
  run_and_log_cmd "sudo mkdir -p work/rootfs/etc/systemd/system/data.mount.wants/"
  cat <<-EOF > work/rpi/etc/systemd/system/mender-grow-data.service
  [Unit]
  Description=Mender service to grow data partition and filesystem size
  DefaultDependencies=no
  Before=data.mount
  Before=systemd-growfs@data.service

  [Service]
  Type=oneshot
  User=root
  Group=root
  ExecStart=/sbin/parted --script /dev/mmcblk0 resizepart ${MENDER_DATA_PART_NUMBER} 100%
  ExecStart=/sbin/resize2fs /dev/mmcblk0p${MENDER_DATA_PART_NUMBER}

  [Install]
  WantedBy=data.mount
EOF

  # Install
  run_and_log_cmd "cp work/rpi/etc/systemd/system/mender-grow-data.service \
                      work/rootfs/etc/systemd/system/"
  run_and_log_cmd "ln -sf work/rootfs/etc/systemd/system/mender-grow-data.service \
                          work/rootfs/etc/systemd/system/data.mount.wants/"
}

PLATFORM_MODIFY_HOOKS+=(grow_data_partition_and_filesystem)

This might mean parted reboots the thing! My systemd version is 241 and I need /data to grow on initial boot.

It should not.

Looked at your syslog, it says,

Apr 16 12:00:14 raspberrypi rsyslogd: imuxsock: Acquired UNIX socket '/run/systemd/journal/syslog' (fd 3) from systemd.  [v8.1901.0]
Apr 16 12:00:14 raspberrypi rsyslogd:  [origin software="rsyslogd" swVersion="8.1901.0" x-pid="297" x-info="https://www.rsyslog.com"] start
Apr 16 12:00:14 raspberrypi cron[343]: (CRON) INFO (Running @reboot jobs)

@mirzak, yeah, it’s not parted. Tried removing it just in case. Looking into cron @reboot jobs. There’s no cron entries on any users, though…

@mirzak, just a short update. You were right about the reboots. I have finally decided to ditch raspbian and go with a yocto build from scratch. Updates work as expected. Thank you so much for your support!

2 Likes