Rootfs grub.cfg uses hardcoded hd0 that is never corrected at runtime, causing boot failure when firmware enumerates other block devices first
Summary
In the grub.d/ integration, the rootfs copy of the generated grub.cfg sets GRUB’s $root using a hardcoded mender_grub_storage_device=hd0 that is never dynamically corrected at runtime. On any boot where firmware enumerates another block device (for example, a USB thumb drive, USB hard drive, or any peripheral that exposes a USB mass storage interface) before the real boot disk, hd0 resolves to the wrong device and boot fails with error: disk 'hd0,msdosN' not found. This affects any GRUB-based Mender deployment regardless of architecture (x86_64, ARM64, i386) or firmware type (UEFI or legacy BIOS) - GRUB’s hdN numbering is determined by firmware device enumeration, which is susceptible to USB device insertion on all platforms.
The ESP copy of grub.cfg is not affected because its 00_05_mender_setup_env_grub block contains a regexp line that extracts the real disk from $root at runtime. The rootfs copy does not get this line emitted because 00_05 is gated behind GRUB_MENDER_GRUBENV_CFG_GENERATION != "true".
Symptom
Loading Linux <version> ...
error: disk 'hd0,msdos2' not found
Loading initial ramdisk ...
error: you need to load the kernel first
Failed to boot both default and fallback entries.
After falling through the GRUB menu and the 10-second timeout, the Debian 00_header’s UUID-based search eventually allows boot via the rescue path. During a Mender upgrade, this failure mode causes mender-update to roll back the upgrade even though the new rootfs is intact, because the boot never succeeds cleanly enough to commit.
At the GRUB prompt during a failing boot, ls -l confirms that hd0 is the unexpected device (often very small, no recognizable filesystem) and the real SSD appears as hd1.
Root cause
The grub.d/README-mender.md states that two-index scripts (00_xx) should emit content only into the ESP copy of grub.cfg, and one-index scripts should emit into the rootfs copy. This is supposed to be enforced by gating two-index scripts with if [ "$GRUB_MENDER_GRUBENV_CFG_GENERATION" != "true" ]; then.
Only 00_05_mender_setup_env_grub and 00_90_mender_boot_selected_rootfs actually implement this gate. 00_00_mender_grubenv_defines, 00_04_mender_setup_env_functions_grub, and 00_80_mender_choose_partitions_grub do not - they emit content unconditionally. Because update-grub runs twice during image generation (once normally to produce the ESP copy, then recursively from 90_mender_generate_dual_rootfs_grub with GRUB_MENDER_GRUBENV_CFG_GENERATION=true set to produce the rootfs copy), ungated scripts emit the same content into both output files. This is how 00_00’s mender_grub_storage_device=hd0 default ends up in the rootfs copy despite the README-mender.md design stating that two-index scripts should only appear in the ESP copy.
Execution flow in the rootfs copy of grub.cfg:
-
Control reaches the rootfs copy via
configfile /boot/grub-mender-grubenv.cfg, with$rootcorrectly set to the active rootfs partition (e.g.hd1,gpt3) by the ESP stage. -
The (ungated)
00_00block setsmender_grub_storage_device=hd0, clobbering the correct value the ESP stage’s00_05regexp had established. -
00_05is gated out in the rootfs stage, so there is no regexp to re-correctmender_grub_storage_device. -
The (ungated)
00_80block and the intentional07_mender_choose_partitions_grub(which is a byte-identical duplicate of00_80) executetest -e (hd0,gpt${mender_ptable_part})/against the wrong disk, fall through to the msdos branch, and set$root="hd0,msdos${mender_ptable_part}". -
The menuentry inherits the broken
$rootandlinux /boot/vmlinuz-...fails.
Minimal, one line fix
Add the regexp line to grub.d/00_00_mender_grubenv_defines immediately after the hardcoded mender_grub_storage_device default:
cat <<'END_OF_MENDER_GRUBENV_CFG_FILE'
mender_rootfsa_part=@MENDER_ROOTFS_PART_A_NUMBER@
mender_rootfsb_part=@MENDER_ROOTFS_PART_B_NUMBER@
mender_grub_storage_device=@MENDER_GRUB_STORAGE_DEVICE@
regexp (.*),(.*) $root -s mender_grub_storage_device
kernel_imagetype=@MENDER_KERNEL_IMAGETYPE@
initrd_imagetype=@MENDER_INITRD_IMAGETYPE@
mender_kernel_root_base=@MENDER_KERNEL_ROOT_BASE@
END_OF_MENDER_GRUBENV_CFG_FILE
Because 00_00 is emitted into both the ESP and rootfs copies, the regexp correction now runs in both contexts. The hardcoded default is retained as a safe fallback if $root is somehow unset. I validated this fix manually on an affected unit over 30+ reboots with zero failures; baseline failure rate was high and consistent.
More invasive fix
The minimal fix relies on 00_00 being ungated, so the new regexp line is emitted into both the ESP and rootfs copies of grub.cfg via the two separate update-grub invocations. A cleaner structural fix would:
-
Add
GRUB_MENDER_GRUBENV_CFG_GENERATIONgates to00_00,00_04, and00_80so they only emit into the ESP copy asREADME-mender.mddescribes. -
Move the regexp out of the gated
00_05section (or duplicate it to a one-index script) so it runs in both stages. -
Deduplicate
00_80_mender_choose_partitions_gruband07_mender_choose_partitions_grub, which are currently byte-identical.
Additional notes
-
The
regexpcorrection in00_05seems to demonstrate that the upstream authors were aware of the disk-renumbering risk and built the right fix - but they appear to have applied it only to the ESP stage. The rootfs stage was left unprotected, presumably because the two-index gating on00_05was intended to suppress it in the rootfs copy, and possibly it was not noticed that no equivalent correction was provided for that stage. -
The
mender_rootfsa_uuid/mender_rootfsb_uuidbranch in00_80/07is never exercised by default, and even when set would only fix the kernel’sroot=parameter, not GRUB’s internal$root. A complete fix for disk renumbering needs to correct both independently. -
00_05and00_90both contain comments explaining the gating intent.00_00,00_04, and00_80do not, suggesting the gating omission may have been unintentional rather than by design.