Deployment fail caused by wrong timesynch and DNSSEC failure

System:

  • Raspberry Pi CM3+, Raspbian, kernel 4.19.75-v7l+
  • Mender client: 2.5.0 runtime: go1.14.7
  • NetworkManager

Summary:

  • Non-working time synchronization may cause trouble when deploying a artifact
  • Wrong time may cause DNSSEC lookup error
  • DNSSEC lookup error makes mender unable to “phone home” to hosts specified by host name (hosted.mender.io) as DNS lookup is failing

This might be a problem on some distrubutions when date/time is way off on first boot. Ref. Can't sync time when time is incorrect due to dnssec · Issue #5873 · systemd/systemd · GitHub closed as late as 2020.06.23.

I’ve done several deployments on devices running Ubuntu without any issues (not cased by myeself…). This is the first Raspbian deployment I’ve done, and currently there are some issues I haven seen before.

I have an Mender Artifact failing caused by DNS lookup failing during report update. The deployment log looks like this:

2021-03-05 11:22:08 +0000 UTC info: State transition: init [none] -> after-reboot [ArtifactReboot_Leave]
2021-03-05 11:22:08 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> after-reboot [ArtifactReboot_Leave]
2021-03-05 11:22:08 +0000 UTC info: State transition: after-reboot [ArtifactReboot_Leave] -> update-commit [ArtifactCommit_Enter]
2021-03-05 11:22:09 +0000 UTC error: Failed to report status: Put "https://hosted.mender.io/api/devices/v1/deployments/device/deployments/xxxx/status": dial tcp: lookup hosted.mender.io: no such host
2021-03-05 11:22:09 +0000 UTC error: error reporting update status: reporting status failed: Put "https://hosted.mender.io/api/devices/v1/deployments/device/deployments/xxxx/status": dial tcp: lookup hosted.mender.io: no such host
2021-03-05 11:22:09 +0000 UTC error: Failed to send status report to server: transient error: reporting status failed: Put "https://hosted.mender.io/api/devices/v1/deployments/device/deployments/xxxx/status": dial tcp: lookup hosted.mender.io: no such host

The artifact was correctly downloaded, and I was able to manually boot the new root by setting mender_boot_part and mender_boot_part_hex. The problem was apparently DNS lookup, and I was not able to ping any known hosts. This lead me to looking at the systemd-resolved log using journalctl. This contained a lot of DNSSEC failure.

Mar 07 03:01:48 somecontroller systemd-resolved[22188]: DNSSEC validation failed for question mender.io IN DS: no-signature
Mar 07 03:01:48 somecontroller systemd-resolved[22188]: DNSSEC validation failed for question hosted.mender.io IN DS: no-signature
Mar 07 03:01:48 somecontroller systemd-resolved[22188]: DNSSEC validation failed for question hosted.mender.io IN SOA: no-signature
Mar 07 03:01:48 somecontroller systemd-resolved[22188]: DNSSEC validation failed for question hosted.mender.io IN A: no-signature
Mar 07 03:01:51 somecontroller systemd-resolved[22188]: DNSSEC validation failed for question ntp.org IN DS: no-signature
Mar 07 03:01:51 somecontroller systemd-resolved[22188]: DNSSEC validation failed for question pool.ntp.org IN DS: no-signature
Mar 07 03:01:51 somecontroller systemd-resolved[22188]: DNSSEC validation failed for question 2.debian.pool.ntp.org IN SOA: no-signature
Mar 07 03:01:51 somecontroller systemd-resolved[22188]: DNSSEC validation failed for question 2.debian.pool.ntp.org IN A: no-signature
Mar 07 03:01:51 somecontroller systemd-resolved[22188]: DNSSEC validation failed for question 2.debian.pool.ntp.org IN DS: no-signature

At this time, i realized thta the controller time was way off (severl hours). Setting the time/date manually fixed the DNS lookup failed becuase of DNSSEC.

To re-produce the error, set the time “wrong” and flush systemd-resolved cache:

$ sudo date  -s 01:31
$ sudo systemd-resolve --flush-caches
$ ping vg.no
ping: vg.no: Name or service not known

In my case, this should not be possible, and the problem was actually a conflict between ntpd and systemd-timesyncd. No need for ntpd, so i removed it with sudo apt purge ntp. As long as time sych is up and running, DNSSEC should also be working correctly.

Edit:
I still don’t have a good solution on this problem. systemd-timesyncd It seems that whenever a new Mender artifact is downloaded, the time is set to the artifact-creation (or image-creation) date. Currently, I see the following woirkarounds:

  • Disable DNSSEC
  • Add a NTP-address in /etc/hosts
  • Add some sort of script to set the time at boot, based on known ip-addresses, RTC or something
  • Add fallback NTP-servers (with ip-address) in /etc/systemd/timesyncd.conf

Hi @johan we definitely require a valid time source for certification validation and such. If your system does not have an RTC (which I believe is true of all Raspberry Pi based boards) then your time will not necessarily be correct on first boot and you will need to wait for the time to sync. We do have a recipe in Yocto which adds a state script to wait for the time to sync before attempting to connect to the server. Perhaps something similar would work for you here.

Drew

Thanks for the response. I guess this is actually not a pure Mender-issue, but more a chicken-and-egg-problem of systemd-timesyncd as described in issue #5873 as DNSSEC requires a correct time, and time synchronization requires DNS lookup using DNSSEC…

Currently, my working solutions is to add a bunch of valid NTP IP-addresses as fallback to /etc/systemd/timesyncd.conf. This synchronized the clock at boot without DNS (and DNSSEC), and I don’t have to remove DNSSEC from resolved.

I’m a bit surprised that this haven’t been reported from other users on hardware without RTC. After all, this was fixed quite recently in systemd.

Adding the following to /etc/systemd/timesyncd.conf is my current working solution:

FallbackNTP=216.239.35.0 213.14.68.38 81.175.5.182 176.235.41.255 50.205.244.108

looks like its fixed in upcoming systemd 248

you could try adding the following to your systemd-timesyncd.service as per the fix

Environment=SYSTEMD_NSS_RESOLVE_VALIDATE=0

Yes, good point, but my systemd is too old. It will be a workaround in any case, and I would prefere to keep workarounds as configurations (and not special builds, service modifications etc.). I would have to install systemd >= v247-rc2, because the systemd running on my distro is v241.

yes, sorry i wasn’t very clear, adding the extra systemd unit key/value would be administered not by changing stock systemd files but by dropping in your own configuration file addition using the systemd configuration ‘overrides’ mechanism for overriding/extending stock configuration files leaving the original stock systemd files untouched. Systemd merges them at runtime. Most build tools have mechanisms for easily adding these files. Mender-convert I believe uses “overlays” to achieve this.

see link below

https://www.freedesktop.org/software/systemd/man/systemd.unit.html

However, this doesn’t seem to help as a relook at the systemd code seems to imply that SYSTEMD_NSS_RESOLVE_VALIDATE didnt come in until later v248 either on a different commit.