BUG REPORT: mender_grub_storage_device not corrected in rootfs grub.cfg, causing boot failure on disk renumbering

Rootfs grub.cfg uses hardcoded hd0 that is never corrected at runtime, causing boot failure when firmware enumerates other block devices first

Summary

In the grub.d/ integration, the rootfs copy of the generated grub.cfg sets GRUB’s $root using a hardcoded mender_grub_storage_device=hd0 that is never dynamically corrected at runtime. On any boot where firmware enumerates another block device (for example, a USB thumb drive, USB hard drive, or any peripheral that exposes a USB mass storage interface) before the real boot disk, hd0 resolves to the wrong device and boot fails with error: disk 'hd0,msdosN' not found. This affects any GRUB-based Mender deployment regardless of architecture (x86_64, ARM64, i386) or firmware type (UEFI or legacy BIOS) - GRUB’s hdN numbering is determined by firmware device enumeration, which is susceptible to USB device insertion on all platforms.

The ESP copy of grub.cfg is not affected because its 00_05_mender_setup_env_grub block contains a regexp line that extracts the real disk from $root at runtime. The rootfs copy does not get this line emitted because 00_05 is gated behind GRUB_MENDER_GRUBENV_CFG_GENERATION != "true".

Symptom

Loading Linux <version> ...
error: disk 'hd0,msdos2' not found
Loading initial ramdisk ...
error: you need to load the kernel first

Failed to boot both default and fallback entries.

After falling through the GRUB menu and the 10-second timeout, the Debian 00_header’s UUID-based search eventually allows boot via the rescue path. During a Mender upgrade, this failure mode causes mender-update to roll back the upgrade even though the new rootfs is intact, because the boot never succeeds cleanly enough to commit.

At the GRUB prompt during a failing boot, ls -l confirms that hd0 is the unexpected device (often very small, no recognizable filesystem) and the real SSD appears as hd1.

Root cause

The grub.d/README-mender.md states that two-index scripts (00_xx) should emit content only into the ESP copy of grub.cfg, and one-index scripts should emit into the rootfs copy. This is supposed to be enforced by gating two-index scripts with if [ "$GRUB_MENDER_GRUBENV_CFG_GENERATION" != "true" ]; then.

Only 00_05_mender_setup_env_grub and 00_90_mender_boot_selected_rootfs actually implement this gate. 00_00_mender_grubenv_defines, 00_04_mender_setup_env_functions_grub, and 00_80_mender_choose_partitions_grub do not - they emit content unconditionally. Because update-grub runs twice during image generation (once normally to produce the ESP copy, then recursively from 90_mender_generate_dual_rootfs_grub with GRUB_MENDER_GRUBENV_CFG_GENERATION=true set to produce the rootfs copy), ungated scripts emit the same content into both output files. This is how 00_00’s mender_grub_storage_device=hd0 default ends up in the rootfs copy despite the README-mender.md design stating that two-index scripts should only appear in the ESP copy.

Execution flow in the rootfs copy of grub.cfg:

  1. Control reaches the rootfs copy via configfile /boot/grub-mender-grubenv.cfg, with $root correctly set to the active rootfs partition (e.g. hd1,gpt3) by the ESP stage.

  2. The (ungated) 00_00 block sets mender_grub_storage_device=hd0, clobbering the correct value the ESP stage’s 00_05 regexp had established.

  3. 00_05 is gated out in the rootfs stage, so there is no regexp to re-correct mender_grub_storage_device.

  4. The (ungated) 00_80 block and the intentional 07_mender_choose_partitions_grub (which is a byte-identical duplicate of 00_80) execute test -e (hd0,gpt${mender_ptable_part})/ against the wrong disk, fall through to the msdos branch, and set $root="hd0,msdos${mender_ptable_part}".

  5. The menuentry inherits the broken $root and linux /boot/vmlinuz-... fails.

Minimal, one line fix

Add the regexp line to grub.d/00_00_mender_grubenv_defines immediately after the hardcoded mender_grub_storage_device default:

cat <<'END_OF_MENDER_GRUBENV_CFG_FILE'
mender_rootfsa_part=@MENDER_ROOTFS_PART_A_NUMBER@
mender_rootfsb_part=@MENDER_ROOTFS_PART_B_NUMBER@
mender_grub_storage_device=@MENDER_GRUB_STORAGE_DEVICE@
regexp (.*),(.*) $root -s mender_grub_storage_device
kernel_imagetype=@MENDER_KERNEL_IMAGETYPE@
initrd_imagetype=@MENDER_INITRD_IMAGETYPE@
mender_kernel_root_base=@MENDER_KERNEL_ROOT_BASE@

END_OF_MENDER_GRUBENV_CFG_FILE

Because 00_00 is emitted into both the ESP and rootfs copies, the regexp correction now runs in both contexts. The hardcoded default is retained as a safe fallback if $root is somehow unset. I validated this fix manually on an affected unit over 30+ reboots with zero failures; baseline failure rate was high and consistent.

More invasive fix

The minimal fix relies on 00_00 being ungated, so the new regexp line is emitted into both the ESP and rootfs copies of grub.cfg via the two separate update-grub invocations. A cleaner structural fix would:

  1. Add GRUB_MENDER_GRUBENV_CFG_GENERATION gates to 00_00, 00_04, and 00_80 so they only emit into the ESP copy as README-mender.md describes.

  2. Move the regexp out of the gated 00_05 section (or duplicate it to a one-index script) so it runs in both stages.

  3. Deduplicate 00_80_mender_choose_partitions_grub and 07_mender_choose_partitions_grub, which are currently byte-identical.

Additional notes

  • The regexp correction in 00_05 seems to demonstrate that the upstream authors were aware of the disk-renumbering risk and built the right fix - but they appear to have applied it only to the ESP stage. The rootfs stage was left unprotected, presumably because the two-index gating on 00_05 was intended to suppress it in the rootfs copy, and possibly it was not noticed that no equivalent correction was provided for that stage.

  • The mender_rootfsa_uuid / mender_rootfsb_uuid branch in 00_80/07 is never exercised by default, and even when set would only fix the kernel’s root= parameter, not GRUB’s internal $root. A complete fix for disk renumbering needs to correct both independently.

  • 00_05 and 00_90 both contain comments explaining the gating intent. 00_00, 00_04, and 00_80 do not, suggesting the gating omission may have been unintentional rather than by design.