Fixing kernel updates not applying in Fedora 43
on
One of the stranger issues you can encounter on Fedora is when new kernels stop being bootable from the GRUB bootloader. This usually happens after major version updates, and has been reported online from Fedora 39 up to the present day. In my case kernels were being installed to /boot/efi rather than /boot, where GRUB expects kernels to be placed. In the process of fixing this issue, I learned about how Fedora and kernel-install determine where to place boot entries.
I first got this issue around the time I updated to Fedora 43. Afterwards I noticed I was stuck booting from kernel 6.16.12-200 despite having a kernel 6.17.x package. After I installed 6.18.0-rc5 from the vanilla COPRs (in an attempt to report a kernel sleep bug), this new kernel was not visible in the GRUB boot process, which automatically booted into 6.16.12 without asking what to select.
While looking online, I found the command sudo grubby --info=ALL to show which boot entries that GRUB could find. This showed that the kernel was kernel="/boot/vmlinuz-6.16.12-200.fc42.x86_64". To find out where it listed kernels from, I ran grubby under strace to see what files and folders it was opening. This command showed that it was reading kernel configurations from /boot/loader/entries/:
root@ivy-fedora:/home/nyanpasu64/...# strace -fs999 grubby --info=ALL 2>&1 | rg open
...
[pid 11036] openat(AT_FDCWD, "/boot/loader/entries/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY <unfinished ...>
[pid 11036] <... openat resumed>) = 3
Strangely, on Fedora /boot is an ext4 partition which holds the /boot/efi mount point for the FAT32 ESP partition. This makes it possible for kernel installation and startup to mix up these two folders. My Arch Linux install mounts the FAT32 ESP at /boot and there's no /boot/efi, so we can't install the kernel in one place and boot from another.
Other possible issues
The most confusing part about my problem was that Fedora and GRUB is a fragile system, and similar symptoms can come from different underlying causes. One person even reported a failure from placing /boot/ in a RAID configuration!
In my research, I came across a Unix Stack Exchange question about missing /etc/machine-id. Apparently this file is required for kernels to install properly. But I had this file; if your system is missing it, you may need to create it to solve the problem.
A different issue is if kernels are installed to /boot/loader/entries, but not visible to GRUB until you regenerate grub.cfg with grub2-mkconfig -o /boot/grub2/grub.cfg. I've read that this happens when blscfg is disabled in grub.cfg, causing the list of kernels to be stored in the GRUB config rather than loaded from the filesystem.
In my case, the missing kernels were not found in /boot/loader/entries, but instead /boot/efi/loader/entries. Annoyingly, /boot/efi/ was only readable from a session of sudo -s or sudo su. Looking up this symptom, I found a Fedora forum post stating that you have to rename /boot/efi/$(cat /etc/machine-id), since:
When
/boot/efi/$machineidexists... the systemd in F39 now considers this machine to be using sd-boot bootloader instead of the usual GRUB.
I tried renaming /boot/efi/(machine-id)/ then running sudo dnf reinstall kernel-core, but it recreated the folder and failed because /boot/efi/ was full:
>>> Running %posttrans scriptlet: kernel-core-0:6.18.0-0.rc5.251115.7a0892d2.348.vanilla.fc43.x86_64
>>> Non-critical error in %posttrans scriptlet: kernel-core-0:6.18.0-0.rc5.251115.7a0892d2.348.vanilla.fc43.x86_64
>>> Scriptlet output:
>>> cp: error writing '/boot/efi/a5e7cbe1b56242238c7c906bfb000359/0-rescue/initrd': No space left on device
>>> dracut[F]: Creation of /boot/efi/a5e7cbe1b56242238c7c906bfb000359/0-rescue/initrd failed
>>> /usr/lib/kernel/install.d/51-dracut-rescue.install failed with exit status 1.
>>>
>>> [RPM] %posttrans(kernel-core-6.18.0-0.rc5.251115.7a0892d2.348.vanilla.fc43.x86_64) scriptlet failed, exit status 1
Transaction failed: Rpm transaction failed.
Why is kernel-core recreating this folder after I removed it?
Tracing the issue
To find out why reinstalling kernel-core was misbehaving, I downloaded a .rpm for kernel-core from the package build page, then ran rpm -qp --scripts kernel-core-6.17.7-300.fc43.aarch64.rpm to view the installation scripts. In hindsight I could've ran rpm -q --scripts kernel-core, which will show the scripts of the installed kernel package. I've reorganized the output of the program in execution order:
nyanpasu64@ivy-fedora ~/Downloads> rpm -qp --scripts kernel-core-6.17.7-300.fc43.aarch64.rpm
# postinstall scriptlet (using /bin/sh):
mkdir -p /var/lib/rpm-state/kernel
touch /var/lib/rpm-state/kernel/installing_core_6.17.7-300.fc43.aarch64
# posttrans scriptlet (using /bin/sh):
rm -f /var/lib/rpm-state/kernel/installing_core_6.17.7-300.fc43.aarch64
/bin/kernel-install add 6.17.7-300.fc43.aarch64 /lib/modules/6.17.7-300.fc43.aarch64/vmlinuz || exit $?
if [[ ! -e "/boot/symvers-6.17.7-300.fc43.aarch64.xz" ]]; then
cp "/lib/modules/6.17.7-300.fc43.aarch64/symvers.xz" "/boot/symvers-6.17.7-300.fc43.aarch64.xz"
if command -v restorecon &>/dev/null; then
restorecon "/boot/symvers-6.17.7-300.fc43.aarch64.xz"
fi
fi
We see that Fedora uses kernel-install to install kernels. Searching the program name, I found documentation on the Arch Wiki saying it came from systemd. The page wasn't very helpful, but I wanted to see if Fedora's kernel-install also came from systemd. By running dnf provides /usr/bin/kernel-install or rpm --query --file /usr/bin/kernel-install, I found that kernel-install came from the systemd-udev-258.2-1.fc43.x86_64 package.
Running sudo kernel-install --verbose (which shows the current configuration) printed a long stream of text (permalink). Note that it says /boot/efi/loader/entries.srel with 'type1' found, using layout=bls, and thinks we've booted from a kernel path in /boot/efi/(machine-id)/ that does not exist (GRUB actually reads from /boot/)! I checked /boot/efi/loader/entries.srel which printed type1, which I did not understand at the time.
So it seemed that GRUB was booting from /boot/loader/entries (and grub.cfg explicitly said "The blscfg command parses the BootLoaderSpec files stored in /boot/loader/entries"), but kernel-install thought GRUB was booting from /boot/efi/(machine-id)/loader/entries and would install all new kernels there. Why would that be the case?
Next I tried deleting /boot/efi/(machine-id)/, before running sudo kernel-install add-all --verbose, to see why it created the folder even though GRUB didn't use it. This printed:
/boot/efi/loader/entries.srel with 'type1' found, using layout=bls.
Using ENTRY_DIR=/boot/efi/a5e7cbe1b56242238c7c906bfb000359/6.16.12-200.fc42.x86_64
mkdir -p /boot/efi/a5e7cbe1b56242238c7c906bfb000359/6.16.12-200.fc42.x86_64
Fixing the issue
Another search for "does grub load entries from /boot/loader or /boot/efi/loader" turned up a new forum post I had not seen before. This discussion started out like the ones before, but eventually turned up the answer:
In this context,
KERNEL_INSTALL_LAYOUT=blsmeanssystemd-bootinstead ofgrub2. That also implies that the kernel and initrd will be stored in the ESP instead of the boot file system.
Scanning over my logs, it was entries.srel that told kernel-install to use BLS! Was it safe to delete it? One quick search for "fedora /boot/efi/loader/entries.srel" later, I found a thread of someone with the issue I experienced. Here someone said:
The file
/boot/efi/loader/entries.srelcould also force the bls mode. If it exists, remove it.
I removed the file, deleted /boot/efi/(machine-id)/ (if you don't delete both, kernel-install will still think you're on BLS), and ran a final kernel-install add-all -v to write to /boot/. Note that the command output will still say you're installing to the machine-id path, but I ran sudo ls /boot/efi/ afterwards and confirmed that the machine-id folder was not created.
Finally I ran sudo grubby --info=ALL which showed my new kernels were properly detected. I rebooted into the new kernel, ran sudo dnf remove kernel-core-6.16.12-200.fc42.x86_64, and was finally free of the scourge of the bad kernel install.
- In hindsight I should've
statted the badentries.srelfile to find out when it was created, and if it could've came from one of my system updates. But at least now I know what causes the issue, and can reconstruct the sequence of events if it ever happens again.