I have a hetzner server (the server that hosts this blog, as it happens) which runs Debian. While waiting to give a presentation at the day job over zoom, I decided this was a good moment to upgrade the system from Debian Bullseye (now EOL) to Bookworm.

This is something I’ve done before, so I didn’t anticipate this to be a problem.

I was wrong.

Anyway, after performing the upgrade in the usual way, I rebooted and was greeted with… well… nothing.

The server failed to restart.

I logged in to hetzner robot and booted the rescue image, but I couldn’t find anything of note in the server logs. Indeed, the logs appeared to be completely untouched. To me, this pointed to a problem with the boot loader.

Thankfully, Hetzner supports a vKVM, so I booted into that. Again, nothing.

Hmm…

On playing around, however, I did notice that if you boot the KVM (which automatically starts the rescue image), and then trigger a soft reset from within the kvm itself, the kvm will remain attached during the boot process, and allowed me to see… a grub error.

Joy. But at least I had identified the fault.

The error in question was complaining about not being able to find normal.mod, which is fairly critical. I poked around the recovery console, mounting the various filesystems in the RAID, but couldn’t find the file. So I attempted to load the kernel manually using insmod… to be treated to another error complaining about linux.mod not being found.

So, grub was completely b0rked.

This, however, gave us the answer….

The fix

The problem is that, for whatever reason, grub (the bootloader) has got messed up. So, we need rebuild it. Since we can’t boot the system, we need to do this from the rescue system. Hetzner does have a installimage tool, but I felt a little wary about running that since my understanding was that this would wipe everything… a bit of a nuclear option.

Thankfully, there was a lower impact solution we could try first.

  1. Confirm your software raid is working by taking a look at /proc/mdstat, and listing the structure using lsblk. Your raid should already be assembled, but if it isn’t, you can run mdadm --assemble --scan
  2. Next, find your mount points, for me:
    • md0 = swap (ignore)
    • md1 = /boot
    • md2 = /
    • md3 = is your home directory, so leave this alone
  3. Now you’re ready to rebuild your filesystem in chroot.
    • Mount your root drive (md2) to /mnt/ mount /dev/md2 /mnt
    • Mount your boot drive (md1) inside – mount /dev/md1 /mnt/boot
    • Bind various system drives
      • mount --bind /dev /mnt/dev
      • mount --bind /proc /mnt/proc
      • mount --bind /sys /mnt/sys
    • Finally, create your chroot: chroot /mnt
  4. Now, rebuild and reinstall grub
    • grub-install /dev/sda (I took a guess this was where my boot loader is, usually the case)
    • update-grub
  5. Exit, unmount, and reboot
    • umount /mnt/dev
    • umount /mnt/proc
    • umount /mnt/sys
    • umount /mnt/boot  
    • umount /mnt
    • reboot

All being well, your server should be back up and running. For me, however, this wasn’t quite the end of the story.

After rebooting, my server was still inaccessible. I repeated the vKVM trick and fully expected to see a grub error, however the server was booting normally.

Using the root password, I logged in to the console and sure enough my server was running, however there was no network connectivity.

A bit of poking around shows that for some reason the network interface name had changed, and the server was hard coded in /etc/network/interfaces to use the incorrect one.

I used ip link show to find the correct network interface address, modified interfaces and restarted.

Boom, server back up… and now I can tell you about it here!

Hope this is of use to someone.

Writing this here, since it caused some issues at the Day Job, and took a little bit of work to debug. Hopefully this will save some of you some time, and will jog my memory should something like this happen again.

Anyway, last week, one of my team wanted to deploy a new release of our Access Management System (ARIA), which involved deploying a bunch of containers. The release procedure worked fine, however after the deploy, the main web app container was entirely unable to talk to our API layer.

My team did a little bit of debug, but it was at this point that it got escalated to me. I rolled the live environment back, and began debugging the problem.

On the face of it, it seemed that networking, or at least name resolution, was no longer working from within the container. A curl call from the command line produced:

curl: (6) getaddrinfo() thread failed to start

However, a connection to an IP address would work. So, I began looking at networking / name resolution. The next step was to see what the name servers were doing… however, nslookup gave me:

isc_thread_create(): fatal error: pthread_create(): Operation not permitted

Interesting… so something was blocking creating new threads within the container. Likely the security model that docker was running… not sure why this would change, but I confirmed this by redeploying with SECCOMP turned off:

security_opts:
      - "apparmor:unconfined"
      - "seccomp:unconfined"

Confirmed, networking was working.

Not sure what’s changed, but it would appear that somewhere down the line the base Apache Linux image has updated, and is now using a different system call for starting threads. Likely a new version of GLIBC has been rolled into the container somewhere.

The final fix was to update the various containers to make sure they were all running on the newer base image, and then to redeploy docker on our estate so that it was running the latest version.

Boom.

Everything back to normal, hope this saves you some time!

Molgenis is an open-source platform for scientific data management and research. The name “Molgenis” is derived from “Molecular Genetics Information Systems.” It provides tools for researchers to design, capture, and share data in the field of molecular genetics and other related areas.

Molgenis is designed to facilitate the handling of large-scale, complex datasets in genomics and other biomedical research domains. It offers features such as data integration, data modeling, and data management. Researchers can use Molgenis to create databases, design forms for data entry, and perform data analysis.

At The Day Job, we’re using it as part of our oncology research infrastructure project to act as a source of truth for certain system information as we build out a distributed access platform to help scientists and doctors conduct their research.

Anyway, at time of writing, there wasn’t a PHP client library for it, so I quickly put one together. Have fun!

» Visit the project on Gitlab...