Replacing the SSD in my laptop

2024-02-29

A few months ago, when using my Thinkpad X220 laptop, I noticed that it hangs up for a moment when accessing a certain directory. Since the laptop is quite old (I bought it pre-owned in 2015) and I use it rather sparingly (I have a beefy tower PC that serves as my main machine), I wasn't very much bothered by it. Eventually, I decided to take a look into the issue.

As my knowledge of computer hardware is very limited, I turned to the internet in search of help. After reading through some forum posts, I decided to start out by asking the disk to perform a SMART self-test. The initial "short" test did not find any issues; while this was comforting, I'd rather err on the side of caution, so I went ahead and ran the "long" test. After some 40 minutes of waiting, the results were available.

SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) # 1 Extended offline Completed: read failure 90% 15553

Oh. Well, shit.

Preparing to clone the disk

So obviously, I needed a new disk. I went on the interwebz and ordered a Samsung 870 Evo series SSD; I use an 860 in my tower PC, so I knew it should be a cost-effective and pretty reliable choice. I also took the opportunity to get a larger disk (500GB vs. 250GB on the original one). After waiting a few days for my order to arrive, I retrieved it from the parcel locker... and then put it on a shelf, as it was Christmas time and I had other stuff on my head.

Fast forward a week or two and I finally had a free weekend. I consulted the internet for an explanation on how to remove the SSD from my ThinkPad, which turned out to be a matter of removing only a single screw (yay for repairability). After taking the disk out of the laptop, I went to my tower PC and removed the side cover to gain access to the SATA connectors.

The faulty SSD, removed from the laptop.

This is where I ran into a small hiccup, as it turned out I did not have any spare SATA data cables. Well, almost. After rummaging through every box-o'-cables I had at home, I found a single spare data cable. Still, the power supply in my PC had only two SATA power plugs, which meant that I had to remove the /home disk from the PC.

After connecting everything, I booted the PC and logged in as root. I then used blivet-gui to partition and format the new disk. Once that was done, it was time to clone the disk. I opened the terminal and started typing in the dd invocation. I decided to start by cloning the root partition, as - worst case scenario - I could always just perform a fresh install of the OS.

$ dd if=/dev/disk/by-id/… bs=1M status=progress of=/dev/disk/by-id/…

Remembering the old joke about how dd stands for "Disk Destroyer", I triple checked that I got the arguments right. (This was made a bit easier by the fact that the old disk still used LUKSv1 encryption.) Finally, I pressed Enter and anxiously watched the progress bar.

After a couple of minutes, dd finished. I tried mounting the copied partition and it worked perfectly. Since the new partition was a bit larger than the original, I also needed to resize the filesystem.

$ e2fsck -f /dev/disk/by-id/… e2fsck 1.47.0 (5-Feb-2023) Pass 1: Checking inodes, blocks, and sizes Inode 1182596 extent tree (at level 1) could be narrower. Optimize? yes Inode 1182663 extent tree (at level 1) could be narrower. Optimize? yes Inode 1182705 extent tree (at level 1) could be narrower. Optimize? yes Inode 1182712 extent tree (at level 1) could be narrower. Optimize? yes Inode 1183059 extent tree (at level 1) could be narrower. Optimize? yes Pass 1E: Optimizing extent trees Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/disk/by-id/…: ***** FILE SYSTEM WAS MODIFIED ***** /dev/disk/by-id/…: 559057/2621440 files (1.3% non-contiguous), 7838169/10485760 blocks $ resize2fs -p /dev/disk/by-id/… resize2fs 1.47.0 (5-Feb-2023) Resizing the filesystem on /dev/disk/by-id/… to 20967424 (4k) blocks. The filesystem on /dev/disk/by-id/… is now 20967424 (4k) blocks long.

This worked like a charm. Feeling confident, I moved on to cloning the /home partition.

$ dd if=/dev/disk/by-id/… bs=1M status=progress of=/dev/disk/by-id/… 530939904 bytes (531 MB, 506 MiB) copied, 22 s, 24.4 MB/s dd: error reading '/dev/disk/by-id/…': Input/output error 506+1 records in 506+1 records out 530939904 bytes (531 MB, 506 MiB) copied, 22.4828 s, 23.6 MB/s

Oh. Well, shit.

ddrescue to the rescue

So obviously, the disk was borked (big surprise). Not knowing how to proceed, I once again turned to the internet for help. Reading through a couple of forum posts and Stack Overflow questions, it seemed that my best choice was to use the aptly named ddrescue to try and rescue the data on the disk.

After skimming briefly through the man page, I typed the command in the terminal and launched it. The whole process took a couple of hours, but then, once it was done...

ipos: 199143 MB, non-trimmed: 0 B, current rate: 0 B/s opos: 199143 MB, non-scraped: 0 B, average rate: 5489 kB/s non-tried: 0 B, bad-sector: 133603 MB, error rate: 11689 kB/s rescued: 65539 MB, bad areas: 654, run time: 3h 18m 58s pct rescued: 32.91%, read errors:262983414, remaining time: n/a time since last successful read: 2h 38m 34s

Huh. So obviously, things were not good. I spent some more time rummaging through forums and Stack Overflow; fortunately, the consensus was that ddrescue can take a couple of tries to clone the disk fully and I should just try again. People also suggested making use of the --reverse option, which makes the program copy the disk starting from the end towards the beginning, and alternating between normal and reverse runs.

To get a better understanding of how the tool works, I consulted the manual. The whole process is split into a couple of phases, with the final phase (called "scraping") going over every sector individually and trying to copy it. This makes it the longest phase - it was responsible for 2½ hours of the 3⅓ hours total running time. As such, I decided to forego trimming on my next attempts. This proved to be a good decision, as watching kernel logs through dmesg later revealed that, after some time, the old disk would just "give up" - it'd stop responding for a minute, and then produce a 100% error rate until it was power-cycled.

After running ddrescue for a second time, the copy was still far from done, but the percentage of data recovered went up, so it seemed that not all hope was lost. Over the next few days, I spent some time restarting the rescue process again and again.

Run no | Direction | Data rescued | Percentage ------ | --------- | ------------ | ---------- 1 | Normal | 65 539 MiB | 32.91% 2 | Normal | 67 653 MiB | 33.97% 3 | Normal | 68 049 MiB | 34.17% 4 | Reverse | 79 674 MiB | 40.00% 5 | Reverse | 98 003 MiB | 49.21% 6 | Reverse | 143 390 MiB | 72.00% 7 | Reverse | 147 711 MiB | 74.17% 8 | Normal | 154 532 MiB | 77.59% 9 | Reverse | 156 994 MiB | 78.83% 10 | Normal | 168 291 MiB | 84.50% 11 | Normal | 170 299 MiB | 85.51% 12 | Reverse | 199 113 MiB | 99.98% 13 | Normal | 199 113 MiB | 99.98% 14 | Reverse | 199 113 MiB | 99.98% 15 | Normal | 199 113 MiB | 99.98%

Finally, I got to a point where there was no more progress. While 99.98% is not 100%, it was still a lot better than I had expected just a few days ago, so I was very happy with the result. Of course, missing 0.02% of filesystem blocks can mean anything - from having no effect at all (say, all those blocks were unused) to making the FS unusable (say, all those blocks were in the superblock area). There was only one way to find out - try and mount the filesystem.

$ mount --read-only /dev/disk/by-id/… /mnt

To my great relief, the filesystem mounted fine, and after spending a couple of minutes rummaging through the directories with cd and ls, I couldn't find any files that were obviously broken. All that was left to do now was to install the disk in the laptop and hope for the best.

Getting the thing to boot

Installing the disk back in the laptop was trivial. After performing all the necessary steps, I plugged the power connector, pressed the power button and anxiously waited for what comes next. After a couple of seconds, the screen told me: No boot device found. Sigh.

I took a Live USB I had laying around and used that to boot into a working system. A short look at the disk contents revealed the cause: as a result of all the shenanigans involved in copying the /home partition, I completely forgot to copy the /boot partition. Welp. Guess I have to remove the disk from the laptop and connect it to the PC again.

My professional disk cloning setup.

After copying the contents of the /boot partition, I dismantled the disk-cloning monstrosity for the last time and put the new disk back in the laptop. Once again, I plugged in the power connector, pressed the power button... And once again, the screen told me: No boot device found.

Feeling a little grubby

Confusion quickly gave way to embarassment, as I recalled the existence of this nifty little thing called the Master Boot Record. My X220 was configured to use the legacy BIOS boot method instead of UEFI; and while cloning the /boot partition transferred the GRUB config, kernel images and other important files, it didn't do anything to the MBR. Having taken a deep sigh of self-disappointment, I plugged the Live USB into the laptop once again.

$ grub2-install --target=i386-pc --boot-directory=/mnt /dev/sda Installing for i386-pc platform. Installation finished. No error reported.

Okay. With that out of the way, I should finally be able to boot, right?

Nope! This time, GRUB told me that it cannot find the root partition. Well, yeah. I forgot to update the disk UUIDs inside the configuration file. A quick trip into the Live USB and a little bit of vim trickery later, things were finally ready. After rebooting, I was greeted by the GRUB menu - never thought I'd be so happy to see it! I selected the kernel to boot and watched the messages fly on the screen. All that was left to do now was to enter the LUKS encryption password... and the kernel didn't ask for it. It just... sat there, doing nothing.

That's a bit of a pickle, in(n)it?

I was getting quite frustrated at this point, as I really had hoped to be long done by this point. After scratching my head for a bit, I tried taking a closer look at the kernel parameters.

root=UUID=… ro resume=/dev/mapper/luks-… rd.luks.uuid=luks-… rd.luks.uuid=luks-…

Clearly, these referenced disks by UUIDs, and yet the kernel didn't want to use them. Consulting an article listing the meaning of the rd. options revealed that, instead of listing every disk using rd.luks.uuid=…, one can also just use rd.auto=1 and let the kernel figure it out on its own. Tired and resigned, I rebooted the laptop to GRUB, pressed e to edit the commands before booting, made my changes, and hoped for the best.

Great success! The kernel's autoassembly of special devices like cryptoLUKS feature worked perfectly and I was greeted with a prompt for the decryption password. I input the passphrase and anxiously awaited as the system... did nothing. After a bit over a minute, I saw an error message informing that it could not complete the cryptography setup. I stared at the screen for a few seconds, until I noticed - those UUIDs seem wrong. Aren't they the old ones?

Ah, right. Of course. I forgot to edit /etc/fstab and /etc/crypttab. Time for another trip into the Live USB. Now everything should work! Except it didn't - I got the exact same error as before. Nothing changed.

Couple more angry sighs of dissappointment later, I realised - but of course! Editing those files could not change anything, as they lived on the encrypted partition. That's a bit like sticking the code for a safe inside the safe. But then, where does the kernel take the fstab and crypttab from? I had a very vague recollection of the idea that an initramfs is involved in the boot process. Could this be it?

After skimming through a few articles, my suspicions were confirmed. During the early boot phase, the kernel uses an initial file system taken from the initramfs. Among the contents of the initramfs are the fstab and crypttab files. Which meant that, while updating them definitely was a good idea, I also needed to edit the copies sitting inside the initramfs I was using. Or, alternatively, generate a new one with updated contents.

It was time for one final trip to the Live USB. No promises on the "final" part, though.

$ mount /dev/mapper/luks-… /mnt $ mount /dev/sda1 /mnt/boot $ mount -t devtmpfs devtmpfs /mnt/dev $ mount -t proc proc /mnt/proc $ mount -t sysfs sysfs /mnt/sys $ chroot /mnt $ cd /boot $ dracut --fstab . 6.5.10-100.fc37.x86_64

This generated a new initramfs at /boot/initramfs.img. Now all that was left to do was to reboot the machine, and, before booting, edit the commands in GRUB.

Okay, let's go. I pointed the kernel at the new initramfs and used rd.auto for good measure. The decryption prompt came up and I entered the password. Nothing happened. I felt my heart go up to my throat... and then the system displayed a message informing that LUKS setup was complete. Huh. Guess I might have cranked the crypto settings a bit too high.

After a few more seconds, I was greeted by the graphical login screen. I entered my username and password and finally, finally arrived at my desktop. Just as I last saw it over a week ago. ...wait, this took a whole week?

The aftermath

Once inside the system, I took a while to test things out and ensure everything works correctly. After that, I regenerated the initramfs one more time (just for good measure) and regenerated the GRUB configuration - which led me to discover that I overlooked editing /etc/default/grub, as the disk UUIDs passed to the kernel are not auto-detected, but rather taken from said file. With these two changes, I was now able to boot the system without having to manually edit the GRUB commands each time.

I then proceeded to upgrade the OS from Fedora 37 (which was EOL for almost two months at this point) to the latest release of Fedora 39. Once that was done, all that was left to do was to resize the /home filesystem. As expected, e2fsck complained a lot, but it was able to fix up the missing bits.

An artistic rendition of the "in and out, 20-minutes adventure" meme from Rick and Morty.

And with that, my (mis)adventure was finally over.

tl;dr, or retracing my steps for future reference

I like to joke that half of my blog's audience is just me, re-reading my own articles to refresh my memory - so here's a quick to-do list, in case I ever need to repeat this.

  1. Hook up the old disk and the new disk to a single machine.

  2. If the disks are of the same capacity, you can use dd to clone the entire disk. Otherwise, you'll have to partition the new disk and copy each partition separately.

  3. If things go awry with dd, pray to your favourite deity and try using ddrescue.

  4. ddrescue can take a few attempts to rescue all the data. The --try-again and --retrim options are your friends. You should also try alternating between normal operation and using the --reverse option.

  5. Remember to clone all the partitions. (duh)

  6. If you're using legacy BIOS for booting, and you cloned individual partitions, you'll need to install the bootloader in the new disk's MBR.

  7. Go through /boot/grub2/grub.cfg, /etc/default/grub, /etc/fstab and /etc/crypttab and update the UUIDs of the disks.

  8. Regenerate the initramfs.

There might be some extra steps involved when using UEFI for booting, or if you ditch GRUB for systemd-boot or some other bootloader. Should you know about those - feel free to contact me and I'll update the list above.

References

Share with friends

e-mail Google+ Hacker News LinkedIn Reddit Tumblr VKontakte Wykop Xing

Comments

Do you have some interesting thoughts to share? You can comment by sending an e-mail to blog-comments@svgames.pl.