Monitoring PSUs in Arch Linux Dell Servers
We currently use Arch Linux exclusively for servers. Much of our equipment comes from Dell and one of the gotchas of using a non-Redhat, non-SUSE linux distribution with their servers is you cannot just drop in their Open Manage tools to monitor everything. As a side note, despite the bloat of Open Manage, it actually isn’t a bad set of tools once you get it installed (on an rpm-based distribution) – the command line utilities you get with it are pretty decent. The GUI stuff is largely worthless in my biased opinion.
In any case, Open Manage is a big pile of rpms with lots of dependencies so it wouldn’t be easy to transfer to Arch. I posted awhile back about using LSI’s Megaraid CLI tool to monitor Dell raid arrays but what about everything else? One big item that was really haunting me was power supplies. I had no monitoring on those things so if I went too long between datacenter visits to check for amber lights on the fronts of the servers, we could have a double PSU failure on a really important server and be in a lot of trouble. Server PSUs fail A LOT so don’t discount the importance of monitoring them. My personal experience might be unusual, but i’ve had more PSUs fail than hard drives.
Yesterday I read up on Intelligent Platform Management Interface (IPMI) and saw that Dell has supported it for awhile. I suspect their Open Manage tools are simply a proprietary wrapper around this.
So you just need some other software that will let you access this stuff. Some quick searches revealed ipmitools and freeipmi being popular. I went with freeipmi because it is a very active project with recent updates (the Arch Linux AUR had them both but both PKGBUILDs were broken).
Installing it is straight forward: download tarball, unpack, ./configure, make, make install. This will place a bunch of command line tools in /usr/local/sbin/. You’ll probably want to script the install process if you have to setup lots of servers. Or, don’t be lazy like me and bundle up a PKGBUILD for the Arch AUR and just use yaourt for the installs.
You can check out the README and man pages for information about all the various commands but I am just using one: ipmi-sel. This prints out the contents of the “System Event Log” and seems to correspond very nicely with any messages that appear on the front LCD of the server. I removed, replaced, and removed again a PSU in a server and saw this perfectly parseable and useful output:
28:21-Aug-2009 09:00:36:Power Supply Status resence detected 29:21-Aug-2009 09:00:37:Power Supply PS Redundancy:Redundancy Lost 30:21-Aug-2009 09:04:56:Power Supply Status resence detected 31:21-Aug-2009 09:04:57:Power Supply PS Redundancy:Fully Redundant (formerly "Redundancy Regained") 32:21-Aug-2009 09:05:16:Power Supply Status resence detected 33:21-Aug-2009 09:05:17:Power Supply PS Redundancy:Redundancy Lost
Other events I have seen in this log are memory DIMM failures and the case being open – and those are conveniently the only other alerts i’ve seen on the LCDs before. You can clear out the SEL and have a few other options just with that one ipmi-sel tool – read the man page for more information.
For monitoring PSU failure I am using this script. I am sure there are tighter approaches but this one works fine:
/usr/local/sbin/ipmi-sel | grep "PS Redundancy:" | tail -n1 | grep "Redundancy Lost" | wc -l
That will return a 1 if the last “PS Redundancy” related item in the log is a failure, and 0 otherwise. You can then easily snap that into Zabbix or whatever monitoring software you prefer. I did a post with more detail on adding items to Zabbix awhile back that might be helpful if you are not familiar with the process.