Working at a startup certainly provides fun stories to tell. We had an amusing interaction today. As back story, we are in “Release Mode” at work and thus things run more strangely than normal.
A Comcast technician finally came out to investigate why our 1 week old cable internet was so flaky. On a side note they concluded it was because of “all the statics” I had configured in the router. I have absolutely no idea what he was talking about (I had no custom settings other than a DMZ host) but when he left our speed was back up so whatever. He swapped a chunk of coax and reset the router to factory settings and I am inclined to crediting those actions for the fix instead.
In any case, when he arrived at our office (with 2 technicians in training behind him) they saw the following at 2pm:
- Lights off except for display glow and some sunlight along one wall
- Hudson dots projected across a wall
- Piles of pizza boxes, cookies, and chip bags
- Beer and red bull being consumed by everybody (have to modulate the effects you see)
- Piles of wire, our custom ducts into the ceiling grid, etc.
They just stood there completely frozen for several seconds until I stood up, turned the lights on, and greeted them. I have to imagine that isn’t typically the environment they walk into
I think our office has a cool, bona fide startup vibe and we get a lot of host incubator tour stops at our door because it generally looks like stuff is getting done.
I realized the other day how dependent I am on muscle memory for remembering passwords. It is possible spatial memory is a better concept to mention.
I suspect I know 50 or so passwords of the truly random variety (I usually shoot for 8 – 10 characters across 3 classes). But, if you asked me for one of those passwords over the phone or in person I would be completely unable to remember beyond the first 2-3 characters without a keyboard to type on. With a full size keyboard in front of me I can effortlessly type out all of those passwords I have stored away. Sometimes if I concentrate and fake-type on a table or wall I can remember but not always especially when a large number of non-alphanumeric characters are involved.
Similarly with my iPhone SSH client (I use pTerm and it works pretty well) I have to setup keys otherwise I am unable to connect to anything as I cannot remember passwords using the 2-finger/thumb keyboards found on phones.
Pretty random post I know but was thinking about this lately. If other people work this way I would appreciate finding out otherwise it seems there is something strange happening on my end.
As part of the recent Hudson setup I needed to run MATLAB scripts remotely on Windows machines. Despite a fair amount of searching I could find instructions no where for making this work. There were however a lot of unanswered posts asking how to do it.
The main challenge is that MATLAB on Windows doesn’t put its output in the console. It instead opens another window (even if you use -nodesktop or any other param combo). That combined with the barbaric licensing they employ is the challenge.
Here is how I got it to work. This approach works great for doing remote running of scripts. It would be annoying if you were wanting to do prolonged interactive work with MATLAB given the requirement of using “-logfile”. I had to get more tricky to get MATLAB scripts hitting the GPU remotely but leaving those steps out of this post. If someone does need to accomplish that let me know and I can continue in the comments.
Why can this be helpful? I can think of a few reasons:
- You have 1 machine licensed to run MATLAB that is shared by several people that only need occasional access. Linux would work better, but perhaps you are stuck with Windows by preference or circumstance.
- You need to run bulk jobs, perhaps by cron or similar, on Windows machines automatically.
- You want to write larger scripts of which M code is only a component while sticking with bash.
- You want to do some sort of automated testing on Windows of MATLAB code.
I have tested this approach successfully in the 32bit and 64bit versions of Windows XP and Windows 7. Perform these steps as the user on the machine that is licensed to run MATLAB.
- Go here, download the setup.exe file, and run that on your Windows machine.
- In addition to the defaults install the openssh and cygrunsrv packages.
- In a cygwin prompt, run
ssh-host-config -y. All prompts should get answered automatically and you will get a “Host configuration finished” message.
- In a cygwin prompt, run
cygrunsrv -S sshd.
- Now open Control Panel -> Admin Tools -> Services.
- Right click “CYGWIN sshd” and hit properties.
- On the Log On tab, change the radio button to “This account” and specify the user name and password for the licensed MATLAB user on the Windows machine.
- Hit Apply and OK and close the services dialog.
- Open Control Panel -> Admin Tools -> Local Security Settings -> Local Policies
- Make sure the licensed MATLAB user on the Windows machine is explicitly listed for each of the following permissions:
- Adjust the memory quotas for a process.
- Create a token object.
- Log on as a service.
- Replace a process level token.
- In a cygwin prompt,
chown [licensed user] /var/log/sshd.logand
chown -R [licensed user] /var/empty
- Now open the services dialog back up and restart CYGWIN sshd. Make sure it starts back up cleanly.
That should wrap up the setup.
General idea is we are using sshd to connect to the Windows machine and run our MATLAB scripts. The key piece is changing the user that runs the sshd process to be the same user that is licensed to run MATLAB.
Now you can connect to this machine and run MATLAB scripts like this:
matlab -nodesktop -nosplash -r "rand,quit" -logfile matlab_output.txt
That would run rand and put the output in matlab_output.txt. You can similarly (or more practically) run entire scripts:
matlab -nodesktop -nosplash -r "my_file,quit" -logfile script_output.txt
You can chain as many commands/scripts as you would like as part of the “-r” parameter by separating them with commas or semicolons. Note that there will be a delay between you issuing the command and when the specified log file gets written to. This is because in Windows a launcher/starter process runs, spawns the actual MATLAB process, and then exits while the MATLAB process continues to crank. You can use “-wait” to eliminate this and have the launcher process remain running until MATLAB terminates. This is particularly helpful when writing scripts that should not continue until your MATLAB call has completed.
With respect to Hudson this setup is really nice. You can just write bash scripts and use ssh to connect to the Windows slaves. Only change is the need to cat the output files if you want them to get logged and/or archived in Hudson.
We are wrapping up a fun Hudson setup at work and I wanted to share our experience at a high level. Hudson is an “Extensible continuous integration server” that is used by a huge variety of companies for projects of all types. Hudson is definitely geared for Java projects out of the box but is very flexible and can be a huge help even on very non-Java code. I have configured a few different environments like this and like Hudson the most for a few reasons:
- Overall it is very polished and bug free. I suspect it being built by Sun for use on huge projects helps it here.
- The user base is sufficiently large so it is easy to find help and there are literally hundreds of plugins covering all sorts of interactions with different languages, version control systems, ticketing systems, etc. It is also possible to write your own plugins.
- It has virtually no dependencies outside of Java. Even for persistence it needs nothing as it uses intuitive flat file structure for storing everything. This also makes it very easy to migrate to a new server.
- The configuration of slave nodes and jobs are as straight forward as it gets.
- It lets you have as much control as you want, dropping to shell scripts is no problem at all.
In general continuous builds can make a huge difference for a development team. Everyone can move more quickly and they are completely worth the time and effort to setup right.
We are working with a large C project that must be built and tested on many platforms. Building alone must be done on Windows 32, Windows 64, Linux 32, Linux 64, and OSX 32. Testing must be done on at least one of each architecture with additional rounds to cover different OS and GPU combinations. This variety of platforms takes a lot of time to test and it simply does not make sense to try and stay on top of the combinations with raw manpower.
The unique part of our testing is that it must be done on a machine with an Nvidia GPU. You will see below that we are using Citrix XenServer Virtual Machines for building but could not do the same for testing since VMs do not have native access to the GPU (with one exception that I know of). More notes about that later.
We build with standard Make and have to use a variety of compilers (gcc, g++, nvcc) to get everything generated.
The Build Machine Hardware
My goal with the build machine was to keep it cheap (under $1500) and to have at least 8 cores for doing builds. Our project can take awhile to build due to its size and number of dependencies and the best way to speed it up is raw CPU with usage of the “-j” argument in Make.
That said, these are the parts I went with. I didn’t shop around and used Newegg for everything for the tracking and RMA convenience. The only exception was a dual molex to 8 pin power adapter that I picked up at the nearby Microcenter for about $15.
|NORCO RPC-450 Black 4U Rackmount Case||$69.99|
|TYAN S7002G2NR-LE Dual LGA 1366||$254.99|
|2 x Xeon E5506||$473.98|
|Diablotek PHD 750W||$79.99|
|WINTEC 6GB (3 x 2GB) DDR3 1333||$229.99|
|6 x Seagate Barracuda 7200.12 ST3160318AS 160GB 7200||$227.94|
|LITE-ON Black 18x DVD-ROM||$19.99|
|Shipping for all the above.||$51.86|
Some notes about these parts:
- If I had upgraded anything it would have been bumping the Xeons up to something with hyperthreading, but that would have broken my spending limit.
- Hard drives setup as 1 for XenServer host, a 4 disk raid 10 (using mdadm on the XenServer host), and a spare drive for the raid 10. The board only has 6 SATA ports so the 6th went to the external DVD-ROM.
- Tyan sells some awesome 2 socket boards if you can spend a little more money including ones with LSI raid controllers, more SATA ports, and up to 4 x16 PCIe 2.0 slots for running loads of GPUs.
- When buying bigger, multi-socket boards like this make sure your power supply has all the connectors needed or that you can buy adapters to compensate. The board above needs a 24pin and 2 x 8pin plugs (1 per CPU) from the power supply.
- The first TYAN board gave a code FF on power up and I had to RMA it. The second board worked without any trouble. When buying parts like this plan on having to RMA something. If you need a machine fast it is probably best to stick with a vendor like Dell.
On the hardware mentioned above we are running 4 virtual machines using XenServer. I settled on XenServer because it is free and full featured. My first attempt was VMWare Server running on Fedora but it is full of bugs and limitations (only saw one of the Xeons, and only allows 2 cores per VM). Their bare metal hypervisor (called ESXi) is supposedly better but it has strict hardware requirements and I wasn’t feeling very confident about VMWare after trying the VMWare Server product.
XenServer is a bare metal hypervisor, a very minimal Linux distribution. It does have a very complete command line interface for interacting with VMs and it does at least ship with mdadm so you can setup software raid arrays to run VMs on. The more feature rich, GUI-based administration app for working with the VMs only runs on Windows unfortunately and it connects to a running XenServer hypervisor. Most things can be done through the Windows application and anything more sophisticated can likely be done through the command line with shell scripts. This Guide has straight forward instructions for getting things running.
Once I settled on XenServer this part was very smooth. I made sure to fire up sshd and VNC Server on all virtual machines so I rarely have to use that Windows-only management application.
The 4 Virtual Machines are handling our builds for Windows and Linux, 32bit and 64bit for each. Each VM has access to all 8 Xeon cores and I staggered the polling frequency for each build in Hudson to avoid fights over CPU resources.
We are not able to use these for testing as we need native access to the GPU. The Parallels Extreme Workstation is another hypervisor that does support native access to certain cards but it is expensive and has very specific hardware requirements. I assumed my pieced together Newegg server would not be a good fit or at the very least would ensure I couldn’t get decent support from Parallels.
Hudson is definitely optimized for Java projects but we are having great success using it on our C project. Without going into too much detail, here are the general pieces of our setup. If anybody else is setting up something similar I am happy to answer any questions.
- Using the shell script option for all build steps.
- Using SSH for ALL slave nodes including Windows. For Windows we are using cygwin to install and run sshd. This has many advantages including being able to pretend you aren’t having to use Windows. More practically this lets you write your Windows scripts in good old bash using the standard Linux tools instead of having to suffer great pain with .bat files. The great part is this means you can often use common scripts regardless of slave OS.
- Using a separate job for each build and test environment.
- Build jobs are architecture specific (e.g. Linux 32bit)
- Test jobs are OS and device specific (e.g. Fedora 10 32bit running card X) and dependent on successful build jobs of compatible architecture. As an example, when the Linux 32bit build completes separate test jobs start up for Fedora 10 32bit card X, CentOS 32bit card Y, etc.
- We have a custom test harness that consists of shell scripts which compare GPU results against CPU results in backgrounded processes. When our test jobs complete a separate script parses this output and generates JUnit-compatible XML reports which Hudson reads. Hudson thinks they are JUnit and provides great trends, graphs, and data about these tests. We background the testing processes so that if they seg fault or time out we can kill them in the main script that is running all of the tests without having to stop testing altogether for that build.
- We have all of our various scripts stored on the master and the first step of the slaves is to scp over the latest versions of these scripts. This makes things much easier to maintain and ensures you only need to add/update scripts in one place.
- Using the Log Parser Plugin to fail builds. This is a great plugin that makes it trivially easy to indicate what indicates a failure in your build output and more generally is great for grouping your output into different categories.
- We tarball the compiled code and include that as an artifact on the last 10 builds of each architecture so that a clean build is always only a click away.
- Using ViewVC and the corresponding plugin to link all change logs in Hudson to the specific diffs. Hudson has plugins for more sophisticated repo browsers like Fisheye but ViewVC is free and functional enough.
- Our Mac builds and testing are handled by the developer laptops in the office. Our laptops are all listed as slaves and Hudson will snag one for usage when it sees one on the LAN. Eventually we will probably grab a Mac Mini to handle this.
Hudson is a great tool. The above setup was a fair amount of work to setup but will be a big help to our development team. We have the comfort of knowing builds and testing are happening constantly and have easy access to change logs, build histories, testing trends, stable build tarballs, build timings per OS, and all sorts of useful information and validation. We are tweaking it and making it better every time someone using it thinks of a change or some new information that would be helpful. I am very happy with the end result and feel that Hudson is flexible enough that we won’t ever outgrow what it can help with.
In the next few days the first book completely about Zabbix (that I am aware of) is being released. Entitled Zabbix 1.8 Network Monitoring it is a welcome release. In my mind books and ever increasing version numbers are a good sign for the health of Zabbix.
While the existing documentation for Zabbix is pretty decent it is presented in a very reference-focused manner. It can explain something you know to ask about very well but doesn’t offer tips or more general context for the best approach to take in your setup. This new book takes that core documentation (restates it perhaps a tiny amount too much) and makes it far more presentable and easier to scan in a way that promotes picking up those little details the documentation is not good for.
I’ve been working through most of this book the last few days and wanted to provide some thoughts both good and bad.
- This is a very complete book. The gaps in the documentation are well filled.
- I’ve setup 1.4 and 1.6 Zabbix servers and installed a 1.8 server to work through the book with. Even when reading it very quickly I learned several new things that I flat out wasn’t aware of or saw cleaner ways to do things (like using the built in IPMI agent support instead of wrapping freeipmi or similar).
- Tasks around maintaining Zabbix are well covered. I greatly appreciate there being dedicated sections and chapters on upgrading, backups, reporting, and performance tuning of Zabbix itself.
- While the book does start out slow with basic content it finishes very strong and covers a lot of more advanced topics that make it obvious the author has actually used the software he is writing about.
- The screenshots, code samples, and command line examples are complete enough that you don’t have to have a terminal open while reading.
- I like the iterative approach used in many of the examples where the reader is shown each layer of the setup to aid with troubleshooting.
- The versions of Zabbix are not highly dissimilar so this would be a good reference for any reasonably recent release (with primarily 1.6 experience the entire book still felt completely relevant).
- The book is pretty verbose and a nontrivial portion is already captured by the freely available documentation.
- The Linux help early on feels out of place and is not particularly good advice. Perhaps it is just me but surely it would be safe to assume intermediate Linux knowledge from an audience that is setting up network monitoring. The parts towards the front about writing init scripts and getting things installed are completely unnecessary on any reasonable Linux distribution (yum, pacman, apt-get all have Zabbix server and clients you can install with a single command). Though, I suppose that information could be helpful if you want a non standard config or some insight into how things work a bit under the hood.
- My biggest issue is the amount of background and the number of pages before what I consider the meat gets covered in depth. That is, triggers, actions, notifications, and charting.
Overall I am happy to see this book released. Despite some rough edges in the interface and documentation (this book hopefully will fill the latter gap) Zabbix is a really powerful, flexible piece of software that deserves more users. I have used Nagios, Zenoss, Cacti, and plenty of my own bash script setups and prefer Zabbix over everything else.
While this book does restate some of the official documentation it does bring deeper examples and better context in addition to a lot of additional content that would be very helpful for someone getting started with Zabbix. It is a long book that probably doesn’t make sense to read cover to cover (you probably will never need all the information in here unless you are building a gnarly monitoring install) but it would be a great learning tool and handy to have on hand as a reference long term for when setting up more advanced configurations.
There is enough good content in here to strongly recommend it to anyone learning about Zabbix especially if you are tasked with setting up a large or complicated installation. If you aren’t sure, give this Sample Chapter a read and see how you like the feel of the book first. That sample chapter is actually a pretty great resource as it covers the basic monitoring, triggers, actions, and notifications that are the core of Zabbix.
If you are looking to monitor your gear with any degree of depth Zabbix really is a great tool. The web-based monitoring options popping up are great for shallow uptime reports and basic notifications on complete outtages but Zabbix can be used to tell you a lot more about why things are failing. At my old company I was monitoring DB TPS, Java heap and garbage collection stats, OpenMQ queue sizes, Postfix queues, end-to-end response time for users logging into their dashboards, and all sorts of other cool stuff in addition to the basics you get with the default templates.
Disclaimer: I was provided a copy of this book by Packt Publishing and was happy to give it a read.
It has been a wonderfully busy 2010 thus far. The blog has suffered but hoping to have a post up soon about building a monstrous build server on a budget.
That said, I have been forced (by software, not by management) to use MySQL on a couple projects in the last month and after being a PostgreSQL user for the last several years it has been an incredibly frustrating experience worth throwing up a couple bullet points about. The more I use MySQL the more frustrated I get. These are high level and not well argued but hoping to get points across. I always invite digging into details.
Reasons why I dislike MySQL:
- The planner is incredibly dumb. I feel like it does the wrong thing most of the time.
- Temporary tables and subselects are relatively worthless. They crush performance and are full of bizarre gotchas due to limitations and bugs in the planner (like not being able to use a temp table more than once in the same query).
- The tuning process is not at all intuitive or consistent. Tuning MySQL queries feels like trying to maximally combine hacks and workarounds for bugs in the planner to achieve something vaguely close to desired speed.
- The planner goes to disk a LOT. Subtle adjustments to the query will prevent it from doing so but why can’t it do the right thing and avoid disk except as a last resort or only when something doesn’t fit in memory? My laptop has 4GB and even with a 100MB DB the MySQL planner goes to disk all the time.
- The documentation is superficial, incomplete, and inconsistent. The examples are trivial and unhelpful.
- Doing a DB dump locks the ENTIRE DB. This is awful. I don’t want to setup replication on a trivial DB (less than say 100MB) just to do backups without locking things up.
This experience has really emphasized some huge advantages of PostgreSQL even ignoring the technical points:
- The code is of exceptionally high quality and exceedingly clean. Corner cases are rare and bugs even rarer.
- The documentation is complete, organized, and consistent. You can genuinely learn 95% of what there is to know about PostgreSQL by reading the excellent documentation (that is kept updated and synchronized with each new release). It is also very easy to find. It feels like they spend as much time and effort on the documentation as they do the code. Only OpenMQ documentation has rivaled PostgreSQL in completeness in recent memory.
- The PostgreSQL planner is very solid and it makes the right decisions most of the time without any assistance from a human. You can use the documentation to build a solid foundation of understanding about how things work and then use that plus your own intuition to achieve desired results. If a feature is available in PostgreSQL it will be fast and fully understood by the planner. The same cannot be said for MySQL where raw cycles are required to slowly absorb all the known bugs, workarounds, and feature-specific knowledge about what is actually usable on a non trivial DB and what is not.
Just some thoughts. Perhaps my perception is incorrect given the disparity in usage time. To be blunt I would smile if Oracle destroyed MySQL and on a semi-related note believe the shenanigans of the founder/creator of MySQL around the Oracle acquisition and trying to get back something that was fairly purchased are completely lame.
It has been quite a year – a good year considering the broad financial turmoil of 2009. Many goals have been accomplished but perhaps the greatest accomplishments for me have been professionally at WTC – a company I have had an active hand in since its earliest days.
Over the last year the number of customers, the amount of revenue, and the degree of stability (both financially and technically) has increased dramatically. It is a completely different company than 12 months ago with serious traction. My role in that progression was in my opinion minor and I chalk the success up to the absolutely stellar team that makes the company run now. I could not be more satisfied or proud of what has been accomplished and could not be happier or more complementary about the team that has gathered around it.
Those facts make it difficult and potentially confusing to mention my decision to step way from an active role in the company. I will still be involved as an adviser, supporter, and proud evangelist but will no longer be actively contributing to the technical development or operations of the company. This change is being made slowly over the next month or two to ensure the transition is smooth and I genuinely believe the company is going to be a great example of success for Atlanta.
What is next for me?
I am joining another startup in Atlanta working on an utterly different product. The founders consist of guys I worked with for years in the undergraduate computer labs at Georgia Tech and the technology is around areas I am very passionate about – those that I was focused on as a student before jumping at a chance to help take WTC off the ground as an early engineer. This opportunity combined with the strides made in the last year to put WTC in a stable, growing position are the reasons for this decision.
Needless to say 2010 should be interesting, I am looking forward to it. I’ll be trying to keep this blog a bit more active as well. I had a few posts in 2009 that really took off and hope to give it more attention in 2010.
There are an awful lot of web design/development firms or agencies out there and a disheartening percentage are just terrible at what they do. It is common to find firms that fail at both the design AND development though often more common to find a place with decent designs that is utterly incapable of writing code, while claiming they can code just fine. They just can’t implement their designs unless the implementation is done via huge images or flash.
Here are a few questions for vetting firms with the goal of purging the truly unqualified ones. These questions are based on actual experiences where the lack of understanding was so complete that I felt embarrassed for them. They can come in handy when looking for a firm to assist with some work.
Filtering the Worst
In what country is your coding done?
Generally I want to hear USA (point being – same country I am in). There are exceptions and I don’t want to get into my broader feelings on offshoring but that is what I need to hear.
What is DNS? Have you ever updated a DNS record?
Don’t care at all about getting a technically correct explanation here. I just want to get a feeling that they understand domains don’t magically point at the right place and that they could handle making simple adjustments when I don’t control the domain.
Using only notepad, textedit, or similar, write a page that has the following:
- a green div
- a button underneath that div
You would likely be amazed at how many agencies this would blow the minds of.
Do you know the difference between GET and POST? When would you generally use each of those?
Want to find general understanding here. Not concerned about a complete, technically correct answer but just want to hear vague familiarity.
Do you know what SSH, SCP, or SFTP are?
Them being familiar at all with just one of these is sufficient. A similar question is how would you get files on a server without FTP?
What are steps you could take to speed up the load time of a page?
What is the difference between server side and client side code? What would you personally prefer for each?
Basic blocks here but there are firms with paying customers that could not answer this question well.
Follow Up Questions
Only if the above completely trivial questions can be answered with some confidence should you consider asking follow up questions. Some ideas:
- How do you structure your code/css?
- What do you use for version control?
- Let me see some stuff you have worked on.
- What libraries do you commonly use?
I am not kidding when I say that I know of professional agencies that cannot answer a single question from that first set satisfactorily. Don’t assume the basics. You have to start from nothing when considering a web dev firm or else you might find yourself in a painful situation down the road. There are some excellent firms out there that can really get good work done so don’t take this as a blanket statement covering them all. It is just fair warning that you have to be careful when looking for someone to work on your stuff.
We currently use Arch Linux exclusively for servers. Much of our equipment comes from Dell and one of the gotchas of using a non-Redhat, non-SUSE linux distribution with their servers is you cannot just drop in their Open Manage tools to monitor everything. As a side note, despite the bloat of Open Manage, it actually isn’t a bad set of tools once you get it installed (on an rpm-based distribution) – the command line utilities you get with it are pretty decent. The GUI stuff is largely worthless in my biased opinion.
In any case, Open Manage is a big pile of rpms with lots of dependencies so it wouldn’t be easy to transfer to Arch. I posted awhile back about using LSI’s Megaraid CLI tool to monitor Dell raid arrays but what about everything else? One big item that was really haunting me was power supplies. I had no monitoring on those things so if I went too long between datacenter visits to check for amber lights on the fronts of the servers, we could have a double PSU failure on a really important server and be in a lot of trouble. Server PSUs fail A LOT so don’t discount the importance of monitoring them. My personal experience might be unusual, but i’ve had more PSUs fail than hard drives.
Yesterday I read up on Intelligent Platform Management Interface (IPMI) and saw that Dell has supported it for awhile. I suspect their Open Manage tools are simply a proprietary wrapper around this.
So you just need some other software that will let you access this stuff. Some quick searches revealed ipmitools and freeipmi being popular. I went with freeipmi because it is a very active project with recent updates (the Arch Linux AUR had them both but both PKGBUILDs were broken).
Installing it is straight forward: download tarball, unpack, ./configure, make, make install. This will place a bunch of command line tools in /usr/local/sbin/. You’ll probably want to script the install process if you have to setup lots of servers. Or, don’t be lazy like me and bundle up a PKGBUILD for the Arch AUR and just use yaourt for the installs.
You can check out the README and man pages for information about all the various commands but I am just using one: ipmi-sel. This prints out the contents of the “System Event Log” and seems to correspond very nicely with any messages that appear on the front LCD of the server. I removed, replaced, and removed again a PSU in a server and saw this perfectly parseable and useful output:
28:21-Aug-2009 09:00:36:Power Supply Status resence detected 29:21-Aug-2009 09:00:37:Power Supply PS Redundancy:Redundancy Lost 30:21-Aug-2009 09:04:56:Power Supply Status resence detected 31:21-Aug-2009 09:04:57:Power Supply PS Redundancy:Fully Redundant (formerly "Redundancy Regained") 32:21-Aug-2009 09:05:16:Power Supply Status resence detected 33:21-Aug-2009 09:05:17:Power Supply PS Redundancy:Redundancy Lost
Other events I have seen in this log are memory DIMM failures and the case being open – and those are conveniently the only other alerts i’ve seen on the LCDs before. You can clear out the SEL and have a few other options just with that one ipmi-sel tool – read the man page for more information.
For monitoring PSU failure I am using this script. I am sure there are tighter approaches but this one works fine:
/usr/local/sbin/ipmi-sel | grep "PS Redundancy:" | tail -n1 | grep "Redundancy Lost" | wc -l
That will return a 1 if the last “PS Redundancy” related item in the log is a failure, and 0 otherwise. You can then easily snap that into Zabbix or whatever monitoring software you prefer. I did a post with more detail on adding items to Zabbix awhile back that might be helpful if you are not familiar with the process.