A filesystem only a penguin could love

About a year ago, when loading games onto my Wii's USB hard drive (shhhh), I found the hard drive would not mount on the Wii but only on Linux. Although I initially suspected a disk incompatibility, digging revealed a bizarre rabbit hole, where broken Linux FAT32 resizing tools produced corrupt partition headers only recognized by Linux, which fsck couldn't fix, and PhotoRec outright hung on when reading.

The problem was actually created years ago, when I most likely used a Linux live USB to shrink the FAT32 game partition and create a NTFS partition for transferring files. I used the file partition occasionally, but did not try playing USB games on my physical Wii until January 2022, when I prepared a Super Mario Sunburn ISO with widescreen and 60fps patches, then copied it to the FAT32 game partition on Linux. Unfortunately my USB loader was unable to mount the game partition, and plugging it into a Windows computer similarly prompted to format the drive (IIRC).

At this point I was fearing a hard drive failure, considering I was working with a decade-plus-old Seagate HDD previously subject to high temperatures from an unvented padded case. But I rebooted to Linux and was again able to copy files out, and fsck didn't report any filesystem errors (aside from the dirty flag being set in one of the two FAT copies). After safely backing up my game files, I was now faced with unraveling the mystery of the Schrรถdinger's FAT32 partition.

The cause

In 2014-04, libparted was changed to enable detecting filesystems on disks with non-512-byte sectors. This inadvertently introduced a bug when resizing FAT32 partitions, where it would write corrupted bytes (uninitialized memory or disk contents?) to the partition's initial 3-byte magic number, instead of the correct partition header (interestingly executable x86 assembly code, an 8-bit jump instruction followed by a nop!). This bug made it into parted 3.2, released on 2014-07.

In 2015-12, Tom Yan reported this bug to the GParted bugtracker, and bisected the issue to the commit introducing the issue. Months passed without an attempt at a fix, before on 2016-04 Curtis Gedak probed through the code to locate the bug and proposed a patch which got accepted in days.

In a travesty of open-source development, (lib)parted did not make a new release 3.3 until 2019-10, resulting in the latest stable version corrupting disks for a whopping >5 years, including nearly 4 years after being reported and >3 years after it was fixed! GParted Live included a patched libparted which fixed this bug starting in 2016-04, but other Linux distros and live USB boot disks may not have incorporated this bugfix patch. To quote the issue thread:

I have been holding off marking it as fixed for the reason that the patch has not been included in an official parted release... For GParted users on distros that do not include the patch, this bug report is a continuing reality.

And so, on some unlucky day, I must've run GParted or KDE Partition Manager to repartition my Wii's hard drive, shrinking the FAT32 partition to make room for a NTFS backup partition. And because I never tried reading the Wii games outside of Linux, I never discovered this disk corruption until much later.

Finding a solution

As mentioned, the corrupted FAT32 partition was readable on Linux, yet unrecognized by Windows (which prompted to format the partition) and Wii. When I tried running data recovery tools, Linux's fsck didn't detect any errors at all despite the clearly corrupted header, and testdisk (forgot which mode) did not recognize the filesystem, instead printing "No file found, filesystem may be damaged" (my TestDisk forum thread).

When I searched for advice on handling "No file found, filesystem may be damaged", one forum thread suggested using PhotoRec to unformat the drive. Unfortunately PhotoRec hung in an infinite loop within photorec_find_blocksize, a sign of a program bug. PhotoRec's developer, cgrenier, suggested that this was because there weren't 10 files with data in the filesystem. I was pretty sure there were at least 10 files when PhotoRec hung in photorec_find_blocksize, but to verify, I took a flash drive and put 20 photos in it, wiped the FAT32 header, then turned PhotoRec to unformatting it, and it hung once again, showing that PhotoRec was in fact stuck in an infinite loop and not searching for files. Sadly cgrenier never did acknowledge this bug in this thread, and it still hangs in the latest release (I did not test the latest PhotoRec dev build as of 2022-12).

Eventually I stumbled across an alternative fix, typing into Bash:

echo -ne '\xeb\x58\x90' | sudo dd conv=notrunc bs=1 count=3 of=/dev/sdb1  # replace with the name of your disk partition

This command overwrites the disk's corrupted FAT32 header with a standard header (matching that produced by Windows and GNOME Disks), allowing Windows and Wii to once again recognize the filesystem with all data intact.

A more invasive alternative approach (which cgrenier mentioned but I didn't figure out right away) was to use TestDisk (not PhotoRec), then after loading the partition table, pick Advanced, Boot, RebuildBS. This appears to generally rebuild the boot sector properly, but also introduces more unnecessary changes beyond fixing the first 3 bytes.

โœจ nyanpasu64 time-travels into the future โœจ

I recently discovered that if I boot Linux and run testdisk /dev/sdb1 on the partition rather than disk, then select None partition table type, I can then set partition Type to FAT32, and TestDisk can browse files just fine using Undelete mode. Though the Boot menu says the boot sector and backup are both bad, and offers to Rebuild BS which fixes the problem (though oddly writes EB 3C 90 rather than EB 58 90).

Additionally, if you run testdisk /dev/sdb and pick the Intel partition type, you can still access the FAT32 partition. Don't pick Analyse (the default option) since it drops the partition when initiating a search and doesn't find it during search. Instead, pick Advanced and pick the partition, at which point you can Undelete (might not have worked at the time, I don't remember) or Boot -> Rebuild BS.

Interestingly, PhotoRec works normally on a regular or "Rebuild BS" corrupted FAT32 boot sector (recovering files but not the directory structure). But if you enable PhotoRec's expert mode, then accept its new offer to unformat the filesystem, it hangs on both types of boot sector (I think, I didn't check in gdb or wait for a long time).

Thoughts

Honorable mention: Fragmented NTFS disks corrupted by resizing

I once lost a Windows installation to the same issue (as far as I know) as Marcan's "Rescuing a broken NTFS filesystem". First I shrank a Windows 10 partition on a 256GB SSD to accommodate a Linux dual boot, but the <150 gigabytes of space I gave Windows wasn't enough room for the files I was using, and I repeatedly filled the disk to near full (to the point where Explorer's disk usage meter turned red), before deleting files to temporarily gain some breathing room. As a result, the disk and MFT became heavily fragmented as Windows searched for places to store folder contents.

After I bought a 512GB SSD, I imaged my disk to the new drive (and likely repurposed the old one). Afterwards, IIRC I tried using KDE Partition Manager (also based off libparted) off a live Linux USB to expand the Windows partition. (I don't know remember if the Windows partition was located before or after Linux, but if Windows was located after, I also moved it rightwards to allow expanding the Linux partition later.) I don't remember if the battery ran out or I shut off my laptop before the NTFS resize completed, or if (more likely?) the resize "succeeded" and I shut off the computer.

When I rebooted the computer, I couldn't mount the filesystem when booting into Linux, so I tried rebooting to Windows assuming it was a hibernation glitch or chkdsk issue. The extent of the problem dawned on me when Windows couldn't boot and failed to mount the C:\ boot drive. Unfortunately I damaged the disk further by running chkdsk or fsck on it without taking a backup first. I did not know how to recover the MFT the way Marcan did, but managed to recover some of my files using RecuperaBit (run in PyPy).

Honorable mention: ntfs2btrfs generates corrupted filesystems

I haven't done a full writeup on this one either, but I filed a bug report at https://github.com/maharmstone/ntfs2btrfs/pull/49 (link to logs). The corrupted btrfs filesystems go read-only upon access, with an error unable to find ref byte nr ... in __btrfs_free_extent called by __btrfs_run_delayed_refs.

I've since found a Stack Exchange thread on a similar crash in __btrfs_free_extent. The answer states concerningly:

A successful btrfs check --repair command doesn't necessarily yield a consistent btrfs filesystem.

I did not run btrfs scrub, and I'm guessing it's built to find data corruption, but also tests the entire filesystem tree's structure when looking for data (but cannot fix structural issues). I also found a GitHub btrfs issue reporting a similar error, but this time triggered by a power outage rather than buggy btrfs-generating tools.

I'm noticing a disturbing trend of filesystems reported clean by Linux's fsck-like tools, despite having corruption resulting in failures in usage or on Windows.