Sorry for the overly long message, but I felt I had to fully explain and prove the source of the problem and its impact.
Short version: The “install” statement in meta-mender-core/classes/mender-bootimg.bbclass partially corrupts the bootimg sparse file. Parts of the file that should be mapped are now unmapped, making it unsuitable for use with bmaptool. Quick fix: replace “install” with “mv”.
Long version, with analysis of root cause and its impact:
I will explain what changed in Yocto and why that has an impact on meta-mender.
The dunfell release of meta-poky included coreutils 8.31. Then, the kirkstone release of meta-poky moved to coreutils 9.0. In this version of coreutils, both install and cp now have a new behavior when copying sparse files: they look for holes in sparse file differently. Using “install” on a sparse file results in a different sparse file, with more holes in it. The difference in behavior can be seen easily with a test like the one below, which is inspired by how meta-mender creates the bootimg.
# Create 50MB sparse file
dd if=/dev/zero of=test.img count=0 bs=1M seek=50
du -h test.img # Output : 0 test.img
# Create FAT filesystem on image
mkfs.vfat -n TEST test.img
du -h test.img # Output: 120K test.img
# I am using install (GNU coreutils) 8.30 for this test
install -m 0644 test.img test_installed.img
du -h test_installed.img # Output: 120K test_installed.img
As you can see, the installed file has the same amount of mapped data as the source file. Now, procure a copy of coreutils 9.0 (easiest is to download/configure/make/make install a release archive of coreutils 9.0, using a non-standard prefix to prevent modifying your OS) and use install 9.0 to install the same file:
# I am using install (GNU coreutils) 9.0 for this test
~/storage/coreutils-9.0/installdir/bin/install -m 0644 test.img test_installed_90.img
du -h test_installed_90.img # Output: 12K test_installed_90.img
The newly installed file now has only 12kB mapped! If you then proceed to create a bmap file from each of these files, both test.img and test_installed.img will result in a bmap with 30 consecutive blocks of 4k. However, doing this on test_installed_90.img results in a bmap with only blocks 0, 13 and 25 mapped. This is bad if this image is then written to a device that initially contains random data, while writing only blocks 0, 13 and 25. You can simulate it this way:
# Prepare a fake device with random data in it
dd if=/dev/random of=test_device count=50 bs=1M
# Create loop device to the fake device, for me this gives /dev/loop3
sudo losetup -f --show test_device
# Write the image to the fake device
sudo bmaptool copy --bmap test_installed_90.img.bmap test_installed_90.img /dev/loop3
# Try to mount and view the contents of your device
mkdir -p mntdir
sudo mount /dev/loop3 mntdir
# This prints a large amount of impossible directories and files + some input/output errors
ls mntdir
# Cleanup
sudo umount /dev/loop3
sudo losetup -d /dev/loop3
If you do the very same test with the original image, or with the copy that was installed using the “old” coreutils, the result is a perfectly fine and working FAT filesystem.
Why is this important for Mender? Because Mender builds the bootimg in ${WORKDIR}, then uses “install” to copy the built bootimg to ${IMGDEPLOYDIR}. This worked in dunfell, not anymore in kirkstone. I searched across meta-mender and meta-poky to see how other types of images are created (including uefiimg, in meta-mender, and ext4, in meta-poky) and every example I could find creates the image file directly in ${IMGDEPLOYDIR}, so the image doesn’t need to be moved/copied at all. I also tested, and simply issuing “mv” instead of “install” fixes the problem also. So, here are 2 possible fixes:
- Use “mv” instead of “install”
- Build the bootimg directly in ${IMGDEPLOYDIR}, removing the call to “install”
I am already using a patch that replaces “install” with “mv”. Let me know if you would like a PR.
NOTE: mender-dataimg.bbclass has the same issue, but I noticed that ext4 is not as weak as FAT and does not appear to be corrupt if submitted to the same type of test. Regardless, the problem exists and should still be fixed there too.