Skip to content

OpenSuSE No More

2008 August 1
by Joe

I’ve installed OpenSuSE on a dozen or so work servers, used it as my previous development environment for about a year, and generally have been a big fan.

However, it has seemed the ‘official’ repositories get more and more out of date (I am running 10.2 mostly and my impression is it has been left to rot) and i’ve grown increasingly frustrated with how slow yast has become at updating its caches of rpms and repositories whenever I want to install or update software. I generally load yast, wait 10 – 15 minutes, then do what I was looking to do. I have the machines set to update themselves every week, do I need to toggle another setting to make them go ahead and update their software lists and repository caches?

That hasn’t been too big of a deal. The machines had been rock solid stable (300 day + uptime) so I didn’t want to fiddle with something that was working. I had a weird experience this week though when I realized what was happening when I rebooted some of the long stable machines.

The first case happened at the office for a machine that wasn’t very important. I rebooted and when the machine came back up there were dozens of errors related to runlevel 3 applications not being able to start because my /var partition wasn’t accessible, networking didn’t start correctly, and the keyboard did not work. For this machine I just blamed it on the HD and requested a replacement from Dell.

Then I went to the datacenter and rebooted a production application server to troubleshoot an amber light and the exact same thing happened. I did not expect that and could not write off that machine as it served several important roles.

I booted up with a live cd and all of the system partitions were fine. Everything could be mounted, fsck came back clean, I could chroot into the SuSE system and stuff worked, I checked over my /boot partition, GRUB configuration and inittab file, I had no idea why it would fail so utterly at boot time.

Our basic (non-Database) server is setup like this:

/boot - primary Linux ext3
swap - primary Linux swap
/ - primary LVM
/dev/system/root as '/' on the LVM partition as ext3
/dev/system/var as '/var' on the LVM partition as ext3

These machines were updating every week but kernel updates are not applied until reboots so my gut feeling was that perhaps something changed due to the kernel upgrade related to LVM. I spent literally 6 hours troubleshooting down this path and was inclined to believe this was the issue because there is in fact a lot of chatter on google about kernel upgrades screwing with LVM. I even tried creating a non-LVM /var, copying the contents of the LVM /var there and booting. That almost worked but the network did not start and I still could not use the keyboard.

At about hour 5 I pulled up the Novell documentation for the init process of OpenSuSE and started working through it step by step chroot’ed in from the live cd.

What was the issue?

OpenSuSE deleted it’s own /etc/init.d/boot script.

Seems impossible right? I’ve watched it happen 3 times now and still have a lot of machines to reboot that I fully expect to have the same problem. Perhaps a penalty for long uptime? I missed it completely when I was checking over inittab initially – I guess I just assumed the core script that kicks off EVERYTHING would not have been deleted by the official update/upgrade process of a mature Linux distribution. I managed to find it by progressively stepping back the initial run level passed to the kernel by GRUB until I could see far enough up the boot process to see the ‘file not found’ message. I didn’t see anyone else having this issue so hoping if someone else hits it they don’t waste half a day chasing false causes and find this post instead.

So now the question is what distribution should go on our servers (a distribution that neuters itself during an upgrade cannot stay). I am pretty fond of Ubuntu server but Sean at the office has pointed me at Arch and I am really, really digging the way they do things. I hope to make a separate post about that at the company blog in the near future.

Comments are closed.