Skip to content

Monitoring Dell Perc5 and Perc6 Disks in Arch Linux

2009 March 11
by Joe

One of the downsides to hardware raid is that it is not as easy to monitor as software raid. Monitoring individual disk status requires proprietary software made to match the hardware. This is the position you will be in if buying Dell equipment with their Perc5/Perc6 controllers. The reason is that to the OS your big raid array is just a single big disk – the hardware controller masks knowledge of the individual disks and their status. You can monitor the Dell disks though, you just need the matching software. This will work for probably any Linux distribution and I suspect the earlier Perc controllers as well.

It is definitely a good idea to monitor your individual disks, else you could have one fail and not even know your raid array is operating in a degraded state. When a disk fails you want to act quickly (and preferably have hot spares configured in your controller) because disks manufactured in the same batches are rumored to frequently fail around the same time. Though I have not experienced this personally, it seems plausible.

The first option you might happen upon is installing Dell’s Open Manage software. That stuff is pretty bloated and you have an adventure ahead of you in getting that to install if you aren’t running Redhat or SuSE (or perhaps Debian for which a repacked set of .debs seems to popup quickly after a new version is released).

The other option and one I will shoot through here is using the Megaraid CLI tool from LSI. The Perc controllers in the Dells apparently are rebranded LSI controllers so you can use this command line tool to extract the good information for monitoring purposes. Here’s the steps to installing it.

Installation

  1. Grab the Megaraid CLI program from LSI. At the time of this post you can get it at http://www.lsi.com/DistributionSystem/AssetDocument/4.00.11_Linux_MegaCLI.zip.
  2. Put the downloaded .zip file on your server somewhere in a temporary location and unzip it.
  3. Unzip the zip file that is unpacked (guess the double zip is for good luck?)
  4. Install rpmextract. That is pacman -Sy rpmextract on Arch Linux.
  5. Unpack the .rpm that was in the innermost .zip file with /usr/bin/rpmextract.sh MegaCLI-4.0.0.11-1.i386.rpm
  6. mv the resulting MegaRAID directory to /opt/MegaRAID.

Note that if you are running one of those fancy rpm-based Linux distributions you will obviously skip the rpmextract part :)

That will do it for the install. Now on to Usage.

Usage

You can run /opt/MegaRAID/MegaCli/MegaCli64 -h to see all the available options. There are a ton and that help output is completely not useful at all. This tool appears to be able to both query the status of disks as well as perform operations against them. I’ve only used it for monitoring and really only use this one command:

/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllnfo -aAll

This will generate a ton of output on each of the compatible controllers in your server. If you have just a server (that isn’t attached to extra storage) you will likely only have 1 controller. If you have an MD1000/3000 or other direct attached storage connected via an extra Perc adapter you could have multiple controllers.

In all of that mess of output there will be a “Device Present” section for each controller. In that section you will see output like this:

                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 0
  Offline         : 0
Physical Devices  : 16
  Disks           : 15
  Critical Disks  : 0
  Failed Disks    : 0

You can see 1 virtual drive listed – this is the big single drive the OS sees. You can also see its status there – Degraded is thankfully 0. If it was greater than 0 it would mean your “Virtual Drive” is degraded likely meaning a disk has dropped out. I suppose Offline would mean your array is fried due to multiple disk failure or isn’t being used at all.

Additionally you can see the physical drives and their status. Thankfully none of mine are Critical or Failed. I can’t say I understand why it says there are 16 drives when there are only 15 unless it is recounting the virtual drive for some reason.

In any case, you can see that by checking those counts you can pick up whether your raid array is operating in a degraded state. Here is a command that grabs just the “Degraded” number from the “Virtual Drives” section:

./MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -a0 | grep "Virtual Drives" -A 1 | awk 'END {print $3}'

Even better, here is one that sums up the “Degraded” number for all controllers. This one is more flexible as it is equally useful on a server with a bunch of different controllers and raid arrays:

./MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aAll -NoLog | grep -A 2 "Virtual Drives" | awk '/Degraded/ {TOTAL += $3} END {print TOTAL}'

The output of that last command is a single integer. That would be easy to snap into your monitoring software of choice – my preference is zabbix so I would add this to my /etc/zabbix/zabbix_agentd.conf file:

UserParameter=custom.megaraid.degradedCount,/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aAll -NoLog | grep -A 2 'Virtual Drives' | awk '/Degraded/ {TOTAL += $3} END {print TOTAL}'

Now bounce your zabbix agent, add the “custom.megaraid.degradedCount” as an item to your server(s) in the web interface and you are set.

This works really well. I tested it out by yanking disks out of running servers and watching their degraded count jump up. You may want to do the same just to ensure it is working end-to-end but don’t say I told you to do it and don’t yank a disk out of an array that can’t handle multiple disk failures. If you do and another disk actually fails while your test disk rebuilds you will be in rough shape.

3 Responses
  1. Mark Luffel permalink
    March 12, 2009

    “yanking disks out of running servers”!!!
    Now that’s malicious testing!
    I’m imagining a corporate team building exercise for datacenter workers, like the catch-me-I’m-going-to-fall-backwards gag.

    Have you tried the next logical step: yanking the power to your live database to test the hot-failover? ;)

  2. March 13, 2009

    Ha, have not tried that test. The database server hasn’t ever let us down so I am just letting that guy run forever. I have tested the warm standby a couple times though just to make sure we actually had one that worked.

Trackbacks and Pingbacks

  1. Monitoring PSUs in Arch Linux Dell Servers | gtuhl: startup technology

Comments are closed.