Skip to content

Monitoring PSUs in Arch Linux Dell Servers

2009 August 22
by Joe

We currently use Arch Linux exclusively for servers.  Much of our equipment comes from Dell and one of the gotchas of using a non-Redhat, non-SUSE linux distribution with their servers is you cannot just drop in their Open Manage tools to monitor everything.  As a side note, despite the bloat of Open Manage, it actually isn’t a bad set of tools once you get it installed (on an rpm-based distribution) – the command line utilities you get with it are pretty decent.  The GUI stuff is largely worthless in my biased opinion.

In any case, Open Manage is a big pile of rpms with lots of dependencies so it wouldn’t be easy to transfer to Arch.  I posted awhile back about using LSI’s Megaraid CLI tool to monitor Dell raid arrays but what about everything else?  One big item that was really haunting me was power supplies.  I had no monitoring on those things so if I went too long between datacenter visits to check for amber lights on the fronts of the servers, we could have a double PSU failure on a really important server and be in a lot of trouble.  Server PSUs fail A LOT so don’t discount the importance of monitoring them.  My personal experience might be unusual, but i’ve had more PSUs fail than hard drives.

Yesterday I read up on Intelligent Platform Management Interface (IPMI) and saw that Dell has supported it for awhile. I suspect their Open Manage tools are simply a proprietary wrapper around this.

So you just need some other software that will let you access this stuff. Some quick searches revealed ipmitools and freeipmi being popular.  I went with freeipmi because it is a very active project with recent updates (the Arch Linux AUR had them both but both PKGBUILDs were broken).

Installing it is straight forward:  download tarball, unpack, ./configure, make, make install.  This will place a bunch of command line tools in /usr/local/sbin/.  You’ll probably want to script the install process if you have to setup lots of servers.  Or, don’t be lazy like me and bundle up a PKGBUILD for the Arch AUR and just use yaourt for the installs.

You can check out the README and man pages for information about all the various commands but I am just using one: ipmi-sel.  This prints out the contents of the “System Event Log” and seems to correspond very nicely with any messages that appear on the front LCD of the server.  I removed, replaced, and removed again a PSU in a server and saw this perfectly parseable and useful output:

28:21-Aug-2009 09:00:36:Power Supply Status :P resence detected
29:21-Aug-2009 09:00:37:Power Supply PS Redundancy:Redundancy Lost
30:21-Aug-2009 09:04:56:Power Supply Status :P resence detected
31:21-Aug-2009 09:04:57:Power Supply PS Redundancy:Fully Redundant (formerly "Redundancy Regained")
32:21-Aug-2009 09:05:16:Power Supply Status :P resence detected
33:21-Aug-2009 09:05:17:Power Supply PS Redundancy:Redundancy Lost

Other events I have seen in this log are memory DIMM failures and the case being open – and those are conveniently the only other alerts i’ve seen on the LCDs before. You can clear out the SEL and have a few other options just with that one ipmi-sel tool – read the man page for more information.

For monitoring PSU failure I am using this script. I am sure there are tighter approaches but this one works fine:

/usr/local/sbin/ipmi-sel | grep "PS Redundancy:" | tail -n1 | grep "Redundancy Lost" | wc -l

That will return a 1 if the last “PS Redundancy” related item in the log is a failure, and 0 otherwise. You can then easily snap that into Zabbix or whatever monitoring software you prefer. I did a post with more detail on adding items to Zabbix awhile back that might be helpful if you are not familiar with the process.

Comments are closed.