Rollback due to kernel not being able to load modules

Hello all. We’re using a mender 2.0 client and a 2.4b1 server. We’re having an issue where the client downloads an artifact and reboots to finish the install. When the client reboots, for some reason the kernel is unable to load any modules including the network and our device rolls back to the previous version which works. The issue where the client rolls back due to not being able to load kernel modules happens infrequently maybe 2% of the time. The issue can be fixed by updating the device to a different version and then updating again to the version that previously failed.

We are getting the following messages from our kernel modules in dmesg:
bcmdhd: disagrees about version of symbol module_layout
snd_soc_cx2072x: disagrees about version of symbol module_layout
wlan: disagrees about version of symbol module_layout
snd_soc_simple_card_utils: disagrees about version of symbol module_layout

I did some research and found that the kernel zImage is the same between a working device and one in the state after the reboot but before the rollback. I also checked the entire /lib/modules tree and all of the modules have the same md5sum as they do on a device in the working state.

I looked at the vermagic number for the network driver and it has the same kernel version as does uname -r for the system.

modinfo wlan |grep vermagic
vermagic: 4.9.88-tn-imx_4.9.88_2.0.0_ga-test+gd69606b85de5 SMP preempt mod_unload modversions ARMv7 p2v8

uname -r
4.9.88-tn-imx_4.9.88_2.0.0_ga-test+gd69606b85de5

I’m at a loss to understand how two devices with the same kernel and modules can have the update work on one device and fail on another. Has anybody else seen the problem where the kernel reboots and can’t load modules? Does anybody have any idea why this is happening or have suggestions on what else to check?

Regards,
Cameron

Indeed very strange error. Can not say that I have come across anything similar before.

The issue can be fixed by updating the device to a different version and then updating again to the version that previously failed.

It sounds strange to me that this would work. This implies that there is some “state” on the device that is not cleared by the first update, but is cleared by the second? Or that something is corrupted first time around.

Just to get a couple of things out of the way:

  1. Are you deploying rootfs images updates in all the mentioned scenarions?
  2. Is the Linux kernel part of the rootfs (e.g in /boot)?

My prime suspect, based on the information you provide and if the answer is yes to both of the above questions, is that it is hardware related (write/read errors). At least I would try to verify if this is they case.

The Mender client will only verify the integrity of the transferred payload, but it is up to the OS to verify the integrity of what is written, so there is a possible for this to happen due to buggy hardware of hardware drivers.

@mirzak

Hi Mirza,
You are correct we are deploying full rootfs updates and the linux kernel is part of the update. I hadn’t thought to consider hardware error as a potential problem. It seems weird that the hardware would have an identical error on multiple devices that led to kernel modules not getting loaded. I already checked md5sums of the kernel and modules and didn’t find differences between a working device and the booted partition without working kernel modules. Something is obviously different in that the modules won’t load but I haven’t figured out what it is that’s different.

Just one more question, is your rootfs configured read-only or is it writable? If it would be read-only you could quite easily determinate if something is going wrong during the write operations as you can compare the checksum in the Artifact to the actual partition checksum.

It seems weird that the hardware would have an identical error on multiple devices that led to kernel modules not getting loaded

I can agree here, a hardware error probably would show different type of errors. But there is a chance that you are also hitting an error/bug that is reproducible giving a consistent behavior. I am grasping here, so this can be totally wrong :slight_smile:.

But the fact that it only happens 2 % of time, is what I am most suspicion to.

I already checked md5sums of the kernel and modules and didn’t find differences between a working device and the booted partition without working kernel modules.

I am not sure to exactly what role the modules.* files could play in this situation, but I would also check if there are any changes in them.

If you can share a full dmesg log, it would be interesting to look trough :slight_smile:

Here’s the dmesg log from the device that failed. I saw that it was going to fail and I killed the mender daemon so that it wouldn’t rollback before I finished looking at the data on the device. I’ve since updated it to a different version and then the version I was trying to get to and it worked fine one again.

Booting Linux on physical CPU 0x0
Linux version 4.9.88-tn-imx_4.9.88_2.0.0_ga-test+gd69606b85de5 (oe-user@oe-host) (gcc version 7.3.0 (GCC) ) #1 SMP PREEMPT Fri Apr 17 14:10:48 PDT 2020
CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c53c7d
CPU: div instructions available: patching division code
CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
OF: fdt:Machine model: TechNexion PICO-IMX6ULL with QCA WLAN module and PI baseboard
Memory policy: Data cache writealloc
On node 0 totalpages: 65536
free_area_init_node: node 0, pgdat 80e6f6c0, node_mem_map 8fdf7000
Normal zone: 512 pages used for memmap
Normal zone: 0 pages reserved
Normal zone: 65536 pages, LIFO batch:15
percpu: Embedded 14 pages/cpu @8fdd1000 s26828 r8192 d22324 u57344
pcpu-alloc: s26828 r8192 d22324 u57344 alloc=14*4096
pcpu-alloc: [0] 0
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 65024
Kernel command line: root=/dev/mmcblk0p2 console=ttymxc5,115200
PID hash table entries: 1024 (order: 0, 4096 bytes)
Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
Memory: 245444K/262144K available (8192K kernel code, 449K rwdata, 3012K rodata, 1024K init, 452K bss, 16700K reserved, 0K cma-reserved, 0K highmem)
Virtual kernel memory layout:
vector : 0xffff0000 - 0xffff1000 ( 4 kB)
fixmap : 0xffc00000 - 0xfff00000 (3072 kB)
vmalloc : 0x90800000 - 0xff800000 (1776 MB)
lowmem : 0x80000000 - 0x90000000 ( 256 MB)
pkmap : 0x7fe00000 - 0x80000000 ( 2 MB)
modules : 0x7f000000 - 0x7fe00000 ( 14 MB)
.text : 0x80008000 - 0x80900000 (9184 kB)
.init : 0x80d00000 - 0x80e00000 (1024 kB)
.data : 0x80e00000 - 0x80e707e0 ( 450 kB)
.bss : 0x80e72000 - 0x80ee3090 ( 453 kB)
SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
Preemptible hierarchical RCU implementation.
Build-time adjustment of leaf fanout to 32.
RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=1.
RCU: Adjusting geometry for rcu_fanout_leaf=32, nr_cpu_ids=1
NR_IRQS:16 nr_irqs:16 16
Switching to timer-based delay loop, resolution 333ns
sched_clock: 32 bits at 3000kHz, resolution 333ns, wraps every 715827882841ns
clocksource: mxc_timer1: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 637086815595 ns
Console: colour dummy device 80x30
Calibrating delay loop (skipped), value calculated using timer frequency… 6.00 BogoMIPS (lpj=15000)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
CPU: Testing write buffer coherency: ok
/cpus/cpu@0 missing clock-frequency property
CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
Setting up static identity map for 0x80100000 - 0x80100058
Brought up 1 CPUs
SMP: Total of 1 processors activated (6.00 BogoMIPS).
CPU: All CPU(s) started in SVC mode.
devtmpfs: initialized
VFP support v0.3: implementor 41 architecture 2 part 30 variant 7 rev 5
clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 9556302231375000 ns
futex hash table entries: 256 (order: 2, 16384 bytes)
pinctrl core: initialized pinctrl subsystem
NET: Registered protocol family 16
DMA: preallocated 256 KiB pool for atomic coherent allocations
cpuidle: using governor menu
hw-breakpoint: found 5 (+1 reserved) breakpoint and 4 watchpoint registers.
hw-breakpoint: maximum watchpoint size is 8 bytes.
imx6ul-pinctrl 20e0000.iomuxc: initialized IMX pinctrl driver
imx6ul-pinctrl 2290000.iomuxc-snvs: initialized IMX pinctrl driver
mxs-dma 1804000.dma-apbh: initialized
vgaarb: loaded
SCSI subsystem initialized
libata version 3.00 loaded.
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
i2c i2c-0: IMX I2C adapter registered
i2c i2c-0: can’t use DMA, using PIO instead.
i2c i2c-1: IMX I2C adapter registered
i2c i2c-1: can’t use DMA, using PIO instead.
i2c i2c-2: IMX I2C adapter registered
i2c i2c-2: can’t use DMA, using PIO instead.
Linux video capture interface: v2.00
pps_core: LinuxPPS API ver. 1 registered
pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti giometti@linux.it
PTP clock support registered
MIPI CSI2 driver module loaded
imx rpmsg driver is registered.
Advanced Linux Sound Architecture Driver Initialized.
Bluetooth: Core ver 2.22
NET: Registered protocol family 31
Bluetooth: HCI device and connection manager initialized
Bluetooth: HCI socket layer initialized
Bluetooth: L2CAP socket layer initialized
Bluetooth: SCO socket layer initialized
clocksource: Switched to clocksource mxc_timer1
VFS: Disk quotas dquot_6.6.0
VFS: Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
NET: Registered protocol family 2
TCP established hash table entries: 2048 (order: 1, 8192 bytes)
TCP bind hash table entries: 2048 (order: 2, 16384 bytes)
TCP: Hash tables configured (established 2048 bind 2048)
UDP hash table entries: 256 (order: 1, 8192 bytes)
UDP-Lite hash table entries: 256 (order: 1, 8192 bytes)
NET: Registered protocol family 1
RPC: Registered named UNIX socket transport module.
RPC: Registered udp transport module.
RPC: Registered tcp transport module.
RPC: Registered tcp NFSv4.1 backchannel transport module.
PCI: CLS 0 bytes, default 64
Bus freq driver module loaded
workingset: timestamp_bits=30 max_order=16 bucket_order=0
NFS: Registering the id_resolver key type
Key type id_resolver registered
Key type id_legacy registered
io scheduler noop registered
io scheduler deadline registered
io scheduler cfq registered (default)
backlight supply power not found, using dummy regulator
imx-sdma 20ec000.sdma: loaded firmware 3.3
pfuze100-regulator 0-0008: Full layer: 1, Metal layer: 1
pfuze100-regulator 0-0008: FAB: 0, FIN: 0
pfuze100-regulator 0-0008: pfuze3000 found.
21ec000.serial: ttymxc2 at MMIO 0x21ec000 (irq = 65, base_baud = 5000000) is a IMX
21f4000.serial: ttymxc4 at MMIO 0x21f4000 (irq = 66, base_baud = 5000000) is a IMX
21fc000.serial: ttymxc5 at MMIO 0x21fc000 (irq = 67, base_baud = 5000000) is a IMX
random: fast init done
console [ttymxc5] enabled
imx sema4 driver is registered.
[drm] Initialized
[drm] Initialized vivante 1.0.0 20120216 on minor 0
brd: module loaded
loop: module loaded
spi_imx 2010000.ecspi: probed
libphy: Fixed MDIO Bus: probed
CAN device driver interface
2090000.can supply xceiver not found, using dummy regulator
flexcan 2090000.can: device registered (reg_base=909ec000, irq=25)
2094000.can supply xceiver not found, using dummy regulator
flexcan 2094000.can: device registered (reg_base=909f4000, irq=26)
20b4000.ethernet supply phy not found, using dummy regulator
pps pps0: new PPS source ptp0
libphy: fec_enet_mii_bus: probed
fec 20b4000.ethernet eth0: registered PHC device 0
usbcore: registered new interface driver kaweth
usbcore: registered new interface driver asix
usbcore: registered new interface driver ax88179_178a
usbcore: registered new interface driver cdc_ether
ehci_hcd: USB 2.0 ‘Enhanced’ Host Controller (EHCI) Driver
ehci-pci: EHCI PCI platform driver
ehci-mxc: Freescale On-Chip EHCI Host driver
usbcore: registered new interface driver usb-storage
usbcore: registered new interface driver usbserial
usbcore: registered new interface driver usbserial_generic
usbserial: USB Serial support registered for generic
usbcore: registered new interface driver ftdi_sio
usbserial: USB Serial support registered for FTDI USB Serial Device
usbcore: registered new interface driver option
usbserial: USB Serial support registered for GSM modem (1-port)
usbcore: registered new interface driver pl2303
usbserial: USB Serial support registered for pl2303
2184800.usbmisc supply vbus-wakeup not found, using dummy regulator
2184200.usb supply vbus not found, using dummy regulator
ci_hdrc ci_hdrc.1: EHCI Host Controller
ci_hdrc ci_hdrc.1: new USB bus registered, assigned bus number 1
ci_hdrc ci_hdrc.1: USB 2.0 started, EHCI 1.00
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 1 port detected
mousedev: PS/2 mouse device common for all mice
input: 20cc000.snvs:snvs-powerkey as /devices/soc0/soc/2000000.aips-bus/20cc000.snvs/20cc000.snvs:snvs-powerkey/input/input0
snvs_rtc 20cc000.snvs:snvs-rtc-lp: rtc core: registered 20cc000.snvs:snvs-r as rtc0
i2c /dev entries driver
usbcore: registered new interface driver uvcvideo
USB Video Class driver (1.1.1)
imx2-wdt 20bc000.wdog: timeout 60 sec (nowayout=0)
Bluetooth: HCI UART driver ver 2.3
Bluetooth: HCI UART protocol H4 registered
Bluetooth: HCI UART protocol BCSP registered
Bluetooth: HCI UART protocol ATH3K registered
usbcore: registered new interface driver bcm203x
usbcore: registered new interface driver btusb
usbcore: registered new interface driver ath3k
sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright© Pierre Ossman
sdhci-pltfm: SDHCI platform and OF driver helper
mmc0: SDHCI controller on 2190000.usdhc [2190000.usdhc] using ADMA
mmc1: SDHCI controller on 2194000.usdhc [2194000.usdhc] using ADMA
usbcore: registered new interface driver usbhid
mmc0: new DDR MMC card at address 0001
usbhid: USB HID core driver
mmcblk0: mmc0:0001 M62704 3.56 GiB
mmc1: queuing unknown CIS tuple 0x01 (3 bytes)
mmcblk0boot0: mmc0:0001 M62704 partition 1 2.00 MiB
mmc1: queuing unknown CIS tuple 0x1a (5 bytes)
imx-sgtl5000: probe of sound failed with error -2
NET: Registered protocol family 26
mmcblk0boot1: mmc0:0001 M62704 partition 2 2.00 MiB
mmcblk0rpmb: mmc0:0001 M62704 partition 3 512 KiB
nf_conntrack version 0.5.0 (4096 buckets, 16384 max)
mmc1: queuing unknown CIS tuple 0x1b (8 bytes)
ip_tables: © 2000-2006 Netfilter Core Team
mmc1: queuing unknown CIS tuple 0x14 (0 bytes)
NET: Registered protocol family 17
bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
can: controller area network core (rev 20120528 abi 9)
NET: Registered protocol family 29
can: raw protocol (rev 20120528)
can: broadcast manager protocol (rev 20161123 t)
can: netlink gateway (rev 20130117) max_hops=1
Bluetooth: RFCOMM TTY layer initialized
Bluetooth: RFCOMM socket layer initialized
Bluetooth: RFCOMM ver 1.11
Bluetooth: BNEP (Ethernet Emulation) ver 1.3
Bluetooth: BNEP filters: protocol multicast
Bluetooth: BNEP socket layer initialized
Bluetooth: HIDP (Human Interface Emulation) ver 1.2
Bluetooth: HIDP socket layer initialized
lib80211: common routines for IEEE802.11 drivers
lib80211_crypt: registered algorithm ‘NULL’
lib80211_crypt: registered algorithm ‘WEP’
lib80211_crypt: registered algorithm ‘CCMP’
lib80211_crypt: registered algorithm ‘TKIP’
rfkill_gpio bt_rfkill: bt_rfkill device registered.
Key type dns_resolver registered
mmc1: queuing unknown CIS tuple 0x80 (1 bytes)
mmc1: queuing unknown CIS tuple 0x81 (1 bytes)
mmc1: queuing unknown CIS tuple 0x82 (1 bytes)
mmc1: new high speed SDIO card at address 0001
cpu cpu0: Registered imx6q-cpufreq
imx_thermal 2000000.aips-bus:tempmon: Commercial CPU temperature grade - max:95C critical:90C passive:85C
mmcblk0: p1 p2 p3 p4
snvs_rtc 20cc000.snvs:snvs-rtc-lp: setting system clock to 2020-07-30 01:33:30 UTC (1596072810)
usb_otg_vbus: disabling
ALSA device list:
No soundcards found.
EXT4-fs (mmcblk0p2): couldn’t mount as ext3 due to feature incompatibilities
EXT4-fs (mmcblk0p2): couldn’t mount as ext2 due to feature incompatibilities
EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null)
VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
devtmpfs: mounted
Freeing unused kernel memory: 1024K
systemd[1]: systemd 234 running in system mode. (-PAM -AUDIT -SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP -GCRYPT -GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID -ELFUTILS +KMOD -IDN2 -IDN default-hierarchy=hybrid)
systemd[1]: Detected architecture arm.
systemd[1]: Set hostname to .
systemd[1]: Initializing machine ID from random generator.
systemd[1]: Installed transient /etc/machine-id file.
systemd[1]: Listening on /dev/initctl Compatibility Named Pipe.
systemd[1]: Started Dispatch Password Requests to Console Directory Watch.
systemd[1]: Started Forward Password Requests to Wall Directory Watch.
systemd[1]: Listening on udev Kernel Socket.
systemd[1]: Created slice User and Session Slice.
EXT4-fs (mmcblk0p2): re-mounted. Opts: (null)
systemd-journald[123]: Received request to flush runtime journal from PID 1
bcmdhd: disagrees about version of symbol module_layout
snd_soc_cx2072x: disagrees about version of symbol module_layout
wlan: disagrees about version of symbol module_layout
snd_soc_simple_card_utils: disagrees about version of symbol module_layout
FAT-fs (mmcblk0p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
EXT4-fs (mmcblk0p4): mounted filesystem with ordered data mode. Opts: (null)
random: crng init done
root@Doppler-258f5467:~#

Answering your other question, our rootfs is writable.