Cheap Home Network Storage

2009 July 3
by Joe

Ever since I installed an SSD in my Macbook Pro I have been finding ways to move more and more stuff off of my laptop hard drive. I wanted to move a bunch of rarely needed stuff to a network share at my house including most of my music library. Ideally I also wanted to be able to play the network stored music from my laptop when on my home wifi.

There are a lot of network attached storage options out there now but they are a bit expensive and have arbitrary limitation mixtures in the software. Plus Lindsey was not keen on me spending more money on gadgets.

Instead I looked at the random gear I had around the house and was able to quickly get raided network storage setup for no cost that is more flexible than most of what you can buy. Nothing new here, this is a pretty familiar setup. It just worked out well for me so wanted to share.

I had the following to work with:

  • An old Dell laptop purchased about 4-5 years ago. 2GB of ram, I don’t know any of the other specs.
  • The cheap WRT54GS Linksys router I was already using at the house.
  • A couple 500 GB external hard drives.

That isn’t much but I imagine many people have a similar combination just lying around. Most of my extra gear (including the last desktop I owned) got absorbed when the company was starting up.

Here’s the steps to set this up for yourself.

[1] Put DD-WRT on the router

The big items I wanted were reserved DHCP (so I could reserve an IP for the new “server”) and VPN (so I could connect remotely to my home network). DD-WRT offers both of these along with loads of additional features. It makes the stock Linksys firmware look like garbage in comparison.

My WRT54GS is an especially cheap router with a tiny amount of memory so I had to install the pared down “micro” version of DD-WRT but it still works great.

For help and information on DD-WRT go visit their site. You can enter your specific router model number and get the exact files you need along with specific instructions. Its really easy to do.

[2] Put Arch Linux on the laptop

Arch Linux is a preference thing but I do believe it is the best Linux distribution. I use it on all of our servers at work now and am glad Sean told me about it. Since in this setup I am creating a dedicated server out of the old laptop and don’t plan to ever use it directly Arch is a perfect fit - just do a base install. No need for x windows or any of that cruft. Arch makes installing things so incredibly easy. In 95% of the cases the steps are to grab the software with pacman, add it to the DAEMONS array in rc.conf (if its something that should start on its own on reboot), and then start it.

Obviously you can use a different Linux distribution if you would prefer but subsequent content below assumes Arch.

To get Arch or to view their fantastic wiki and documentation go to archlinux.org.

At this point I gave the Arch server’s mac address an IP reservation in DD-WRT. That’s optional but I would highly recommend either doing that or using a static IP.

[3] Connect external drives to laptop and raid them

SSH into the Arch server and plug the drives in. The faster the hard drive connection the better but honestly for stuff you are going over a network for it doesn’t really matter. Your laptop hard drive is probably going to be /dev/sda and your plugged in external drives are probably going to be /dev/sdb and /dev/sdc. I am making that assumption in the following, you can ls -l /dev/sd* to verify what is showing up. On my machine the external drives showed up as “storage” (for the group permission) while the laptop drive was “disk”.

Install mdadm with pacman -Sy mdadm. This is for the software raid. Also install parted with pacman -Sy parted. This is what I use for partitioning instead of fdisk. For more information about mdadm read the wikipedia page which has good tutorials and more information in the external links section.

Now, assuming /dev/sdb and /dev/sdc for the external drives as mentioned above do the following. This is all done as root.

  • Partition the first drive. Type parted /dev/sdb and you will get the parted prompt.
    • From the parted prompt type mklabel gpt.
    • Type mkpart primary 0 100%.
    • Type quit to leave the parted prompt.
  • Now partition the second external drive in the same way (just do parted /dev/sdc and repeat the steps).
  • Create your raid array with mdadm --create --verbose /dev/md0 --level=raid1 --raid-devices=2 /dev/sdb1 /dev/sdc1. This is creating a raid 1 with 2 drives.
  • Now format your new raid array with mkfs.ext3 -b 4096 /dev/md0. This will format with the ext3 file system but you could use whatever you wanted instead just as well.
  • Create a directory wherever you want for mounting the new raided disk. I used /var/raid_array. That would be cd var and mkdir raid_array.
  • Mount the new disk with mount /dev/md0 /var/raid_array. The raided disk should be usable now. Do a df -h to make sure it mounted and verify the usable space is what you expected.
  • Edit /etc/fstab and add a line containing /dev/md0 /var/raid_array auto defaults 0 1 to ensure the new disk gets mounted on reboot. You can obviously change those mount parameters to whatever values you prefer.
  • If you want, do some benchmarking and consider bumping up readahead on your new raid array. See my last post for information about that and how important it can be. BUT, if these are USB-connected external drives no need to waste time tuning. With software raid at least the OS knows there are two drives involved and so will tune up your readhead to 512. Just wanted to mention it as generally when setting up a raid array (or any decent hard drive) you want to be aware of that setting.

In the above I was assuming the external drives were the same size. If they are not then in the “mkpart” command use a fixed MB amount or smaller percentage for the second number instead of 100%. You could then separately partition any remaining space so it is at least usable if not in the raid.

The above uses raid 1 but mdadm can handle a lot of different configurations, just tweak the “–level” and “–raid-devices” parameters as needed.

At this point you have a raid array usable at /var/raid_array on the laptop “server”.

[4] Setup samba

Your new raid array wouldn’t be much use if you couldn’t easily access it. Use samba to set it up as a network share that all of your home computers can see or use. I’d recommend checking out the consistently awesome Arch Wiki for samba instructions. As with all things Arch, installation and configuration is very straight forward and cruft free. Your mileage may vary on other distributions.

[5] Finish up

From here you can setup whatever you like. Here is what I did.

  • Setup an iTunes share so my music on the network just shows up when I am at home. You can do this by setting up a daapd server, check out the wikipedia article about daap for an overview of what it is. There is unfortunately not an Arch Wiki article for this so here are the steps if interested.
    • Grab the needed software with pacman -Sy mt-daapd and pacman -Sy dbus.
    • Edit /etc/mt-daapd/mt-daapd.conf. Set the “mp3_dir” to the root location of your music and the “servername” to your server name. You will probably want to add more extensions to “extensions”. I added “.m4a,.m4p,.aac”.
    • Add dbus, avahi-daemon, and mt-daapd (in that specific order) to the DAEMONS array in rc.conf.
    • Start them up in that order with /etc/rc.d/dbus start, /etc/rc.d/avahi-daemon start, and /etc/rc.d/mt-daapd start.
    • That is it. Your music should show up in iTunes as a shared library with whatever name you set as “servername” in mt-daapd.conf.
  • Setup a backup script that pushes select important items up to s3 each week. I used s3sync to do this but you can use whatever you like. s3sync is straight forward and easy to use in command line scripts. Here is the s3sync site and here is a blog post that describes usage well.
  • Setup a basic monitoring script, particularly to watch the status of your raid. You can use mdadm --detail /dev/md0 to get the status of your drives and write a script to parse out the pieces you want. Since this is software raid with random external hard drives you aren’t going to get a nice red LED telling you when a drive is dead. Have it check each morning and email you if things are busted up. If you are sending email in your scripts do a pacman -Sy postfix first.

Conclusion

The above may look like a lot of steps but it really is more straight forward than it seems. With Arch especially, getting things setup is a snap. This is a cheap solution and isn’t going to be a screamer on performance but it offers a lot of advantages:

  • Cost $0 if you have an old laptop and a couple hard drives.
  • Even though its a cheap solution you still have complete features (raid, networked, iTunes share, plus whatever else you install on the server).
  • Can be easily expanded with additional drives, software raid in general offers a lot of flexibility and easy monitoring.
  • Laptops actually make decent home servers because they have batteries in them - power outtages aren’t a big deal. On the other hand, your external hard drives will still die so maybe not a huge advantage.
  • You have a actual machine on top of the drives so you can easily add whatever server software you want. You can use it as an SSH tunnel, web server, whatever.

Dell MD1120 + Perc6/E Performance

2009 May 13
by Joe

The Hardware

We recently ordered one of Dell’s MD1120 units and a Perc 6/E raid card with 512MB battery-backed cache to beef up our production database.

Dell’s raid controllers are rebranded models manufacturered by other companies and they have been hit or miss. They’ve done some horrible things (including advertising raid1-concatenated as raid10 a long while back) but my impression from reading online and from my own benchmarking are that these Perc6 cards are decent (but not exceptional). You still get the lockin aspect - Dell won’t support your machine if there is a non-Dell raid card in it and the MD1xxx units supposedly only connect to Perc5 and Perc6 cards.

The MD1120 itself is a pretty cool unit. Its only 2U and packs a lot of drives. We ordered one with 24x 73GB 15K SAS drives. No SSDs, I am amazed by SSD numbers but figure we can wait a few more years before shelling out the cash to fill an array with them. I want more data on their reliability in a 24×7 high IO server environment. Here’s a picture of the new guy.



  

The next time we have the need and budget to purchase a new database server from the ground up I plan to go whitebox but in this case we were looking for a relatively inexpensive way to get more capacity out of our existing Dell server and this seemed like a good option. I just had to ignore their storage tech guy who wanted me to buy a Gigabit SAN unit and screw our performance. So just ignore the tech guys that are part of the sales process and do your own research and benchmarking.

After benchmarking this Perc6+MD1120 combination extensively and putting it in production I am reasonably happy with its performance. Going to share those numbers now as it is sometimes hard to track down data on these things.

The Benchmark

Bunch of notes about the testing environment and configurations for anyone interested. If just want numbers skip past these.

  • Perc 6/E upgraded to latest 6.2.0-0013 firmware and connected to a new PowerEdge 1950 with 2x Xeon E5410s and 8GB RAM.
  • MD1120 connected directly to Perc 6/E.
  • All hardware raid configured with 64kb stripes, write back enabled, read ahead disabled (Dell hardware read ahead isn’t good).
  • Server running latest opensuse. Did this purely to make it easier to upgrade firmware, get Dell support etc. If you call Dell and are using opensuse just lie and say you are running Suse 10 - everything will work and they will never know the difference.
  • All tests were run 3 times and the middle run was recorded.
  • xfs mount options were just noatime and ext3 mount options were noatime,data=writeback.
  • xfs file system params were -b size=4096 -d su=64k,sw=X where X was the appropriate value for the configuration involved. ext3 params were -b 4096 -E stride=16,stripe-width=192.
  • dd params were “bs=8k, count=2000000″ ensuring a file 2 times size of RAM to bypass OS cache.
  • The bonnie++ random seeks/second is the most important number for DB performance.
  • I did a ton of tests with the first configuration and then settled into a groove of just testing the bits that seemed to matter. Hence the odd distribution of tests by config.

The dd and bonnie++ 1.02 results on opensuse

The distinct raid configurations are color coded and numbered. The winning individual tests are bolded.

Record# Raid Level Linux Params Results
Test Config HW SW Total File System Read Ahead Sched. dd Write MB/s dd Read MB/s bonnie++ seeks/s
1 1 24disk raid10 None 10 xfs 256 cfq 540 519 787.4
2 1 24disk raid10 None 10 xfs 256 noop 471 439 811.6
3 1 24disk raid10 None 10 xfs 256 deadline 494 429 812.1
4 1 24disk raid10 None 10 xfs 4096 cfq 544 836 802.7
5 1 24disk raid10 None 10 xfs 4096 noop 474 837 809.4
6 1 24disk raid10 None 10 xfs 4096 deadline 492 791 808.4
7 1 24disk raid10 None 10 xfs 8192 cfq 533 853 805.9
8 1 24disk raid10 None 10 xfs 16384 cfq 536 976 806.3
9 1 24disk raid10 None 10 xfs 32768 cfq 543 1035 808.6
10 1 24disk raid10 None 10 ext3 32768 cfq 332 602 695.2
11 1 24disk raid10 None 10 ext3 4096 cfq 339 929 743
12 1 24disk raid10 None 10 ext3 4096 noop 356 925 765.4
13 1 24disk raid10 None 10 ext3 4096 deadline 342 909 712.9
14 2 12disk raid10 None 10 xfs 4096 cfq 566 572 780.4
15 2 12disk raid10 None 10 xfs 4096 noop 561 567 788.9
16 2 12disk raid10 None 10 xfs 4096 deadline 552 571 786.6
17 2 12disk raid10 None 10 xfs 8192 cfq 566 623 778
18 3 2×12disk raid10 raid0 100 xfs 256 cfq 560 507 535.9
19 3 2×12disk raid10 raid0 100 xfs 4096 cfq 560 955 816
20 3 2×12disk raid10 raid0 100 xfs 8192 cfq 558 857 817.5
21 4 24disk raid6 None 6 xfs 256 cfq 436 478 415.9
22 4 24disk raid6 None 6 xfs 4096 cfq 440 1038 666
23 4 24disk raid6 None 6 xfs 8192 cfq 437 1054 670
24 4 24disk raid6 None 6 xfs 8192 noop 434 1058 651.7
25 4 24disk raid6 None 6 xfs 8192 deadline 435 1044 666.1
26 4 24disk raid6 None 6 xfs 16384 cfq 437 1083 667.3
27 5 24disk raid60 None 60 xfs 256 cfq 424 391 670.2
28 5 24disk raid60 None 60 xfs 4096 cfq 426 1038 669.6
29 5 24disk raid60 None 60 xfs 8192 cfq 424 1052 669.9
30 5 24disk raid60 None 60 xfs 16384 cfq 424 1082 657.5
31 6 3×8disk raid10 raid0 100 xfs 256 cfq 557 530 621.4
32 6 3×8disk raid10 raid0 100 xfs 4096 cfq 555 936 820.6
33 6 3×8disk raid10 raid0 100 xfs 8192 cfq 560 902 817.7
34 6 3×8disk raid10 raid0 100 xfs 16384 cfq 555 1041 815.5
35 7 24disk jbod raid10 10 xfs 256 cfq 367 573 817.5
36 7 24disk jbod raid10 10 xfs 4096 cfq 360 964 814.7
37 7 24disk jbod raid10 10 xfs 8192 cfq 358 994 816.3
38 7 24disk jbod raid10 10 xfs 16384 cfq 377 1049 818.7
39 8 12×2disk raid1 raid0 10 xfs 256 cfq 549 408 598
40 8 12×2disk raid1 raid0 10 xfs 4096 cfq 549 714 578.5
41 8 12×2disk raid1 raid0 10 xfs 8192 cfq 546 643 563.8
42 8 12×2disk raid1 raid0 10 xfs 16384 cfq 546 861 549.9
43 9 24disk jbod raid0 0 xfs 16384 cfq 743 1054 687.6
44 10 24disk raid0 None 0 xfs 16384 cfq 773 1094 671.8

  

Observations:

  • Raid10 is the best option. Winning configurations are pure hardware raid10 with loads of readahead (test #9) and software-striped raid 10 for “raid 100″ (tests #19 and #34).
  • I wasn’t impressed with raid6 or raid60 and raid0 isn’t a realistic option so that is why those setups aren’t as heavily hit in the above configurations.
  • The IO scheduler didn’t really make much difference. CFQ seemed to be just as good or better so stuck with it (its the default).
  • readahead makes a huge difference. Linux defaults this to 256 per drive and Linux sees a hardware raid array as 1 drive. You absolutely must increase that 256 default value to at least 4096 in my opinion. I increased it as high as 32768 for the pure raid10 config and performance didn’t suffer in the seeks/sec or as reported by pg_bench while sequential read speeds increased dramatically.
  • xfs is faster than ext3 in all tests I compared them in. Included just a few ext3 numbers above. But, don’t use XFS (or ext3 with data=writeback) unless you have both a battery-backed cache on your raid controller and are connected to a UPS for main power (and ideally you should be monitoring the health of the BBU on the raid controller to ensure the battery isn’t dead). You could lose data if this advice is ignored.

The bonnie++ 1.03e results on Arch Linux

Next I installed Arch Linux which happens to come with bonnie++ 1.03e. I did a couple bonnie++ runs just to make sure the new OS didn’t mess anything up and was shocked to see dramatically better random seeks/second numbers. Sequential speeds were virtually the same, but seeks/second was massively faster. Here is a table showing some of the configs (I didn’t retest them all - was running out of time) with bonnie++ 1.03e seeks/second numbers. I am going to credit the newer version of bonnie++ for this difference. I retested enough configurations to feel good about the trends seen when testing with opensuse and bonnie++ 1.02 still holding. Open to hearing other possible explanations for the performance increase. I was glad to see these numbers as I was disappointed with the apparent 800ish ceiling I was seeing in the first batch of tests.

Here are the configs I retested (middle of 3 runs again). The Test# matches the row in the above table. The new 1.03e score is listed next to the old 1.02 score.

Test# Config# bonnie++ 1.02 seeks/sec bonnie++ 1.03e seeks/sec
1 1 787.4 1613
4 1 802.7 1652
5 1 809.4 1639
6 1 808.4 1688
7 1 805.9 1684
8 1 806.3 1697
9 1 808.6 1717
19 3 816 1662
26 4 667.3 1056
30 5 657.5 1168
34 6 815.5 1705
38 7 818.7 1560
44 10 671.8 1175

  

Observations:

Higher numbers across the board. No big new insights. Raid 10 still wins, raid 6/60 still substantially slower. At this point that pure raid 10 config (test #9) that scored a 1717 is looking pretty nice. The pure software raid 10 (test #38) fell behind the hardware version further. Biggest take away from this is to be absolutely certain that when you are benchmarking disks the OS and tools are identical.

The pgbench results

Finally, I took the 3 fastest configurations and did pgbench runs with those. I was running out of time so I took what looked like the winner (test #9) and additionally tuned readahead and schedules a bit to ensure I got the best combination. pgbench isn’t perfect and there are people who dislike it but it gives us another number to compare and consider along with the raw dd/bonnie++ numbers already known. Keep in mind I am using a PostgreSQL install and pgbench on this new server - not the actual production server. Doing the benchmarks on the actual final server just wasn’t an option. The only significant difference between the benchmark server and the production server is that the latter has 4 times the RAM and has had nontrivial postgresql.conf tuning so I can only assume these numbers would improve a good bit.

I just tweaked a few things in the postgresql.conf file on the benchmark server. The non default values:

  • shared_buffers = 2048MB
  • checkpoint_segments = 10
  • effective_cache_size = 4096MB
  • max_connections = 500
  • work_mem = 20MB
  • maintenance_work_mem = 128MB
  • synchronous_commits = off
  • random_page_cost = 2.0

Note that I did have to increase SHMMAX to get shared_buffers + connections that high. See my PostgreSQL setup post for more information about that.

Between each run I dropped the “test” database and recreated it. I initialized with these commands. Note the different scale factors to get nontrivial data amounts - scale factor 1000 is pretty large but was necessary before I saw the disks working constantly during the benchmarks.

pgbench -i -s 100 -U postgres -d test
pgbench -i -s 1000 -U postgres -d test

Then I ran the tests specifying 40 clients and 10,000 transactions per client in the params. Read more about the pgbench tool for the test differences and what the transactions involve.

Here are the pgbench results. Numbers are transactions/second. Values listed are again the middle of 3 runs. I would have liked to do these pgbench runs on more configs but I again was running out of time and these suckers took a long time to run with all the dropping and re-initializing.

Test# Config# TPC-B s=100 SELECT s=100 TPC-B s=1000 SELECT s=1000
4 1 1580 10885 1224 3655
5 1 1591 10667 1270 3499
6 1 1567 10647 1267 3048
7 1 1551 10656 1202 3478
8 1 1553 10644 1209 3386
9 1 1606 10759 1296 3548
19 3 1581 10743 1311 3269
34 6 1563 10677 1323 3156

  

Observations:

These pgbench tests were mostly a wash. But that isn’t a big surprise considering I only compared the 3 best configurations that were close in the dd/bonnie++ tests. Really wish there had been time to get some other configs in that table to see the difference. The #9 config did well with a top 3 number in all 4 tests.

Conclusion

I went with setup #9. Pure hardware raid 10. It won almost all tests (with the striped raid 10s being the only real competitors) and pure hardware raid is super easy to configure, maintain, and monitor.

More generally this MD1120 performs pretty well, especially for the relatively low price. As a quick note on price buying one of these MD1120s as configured above along with a PowerEdge 1950 with 16GB RAM and a Perc6/E card to connect them would cost less than $15k (about $540/mo for a $1 36mo lease) depending on the configuration and the deal you got. You could probably even shave a few thousand off that number if you got a real good deal.

If you do buy stuff from Dell be sure to get in touch with a small business sales team. They can offer nontrivial discounts on the price you can finagle in their online shopping cart, you get to talk to the same people every time you place an order, it gives you a contact if you have questions or run into issues with technical support, and it just generally is the way to go.

This MD1120+Perc6/E combo is connected to our existing DB server now and all I will say is performance is excellent. I am seeing zero query backup, barely any IO wait reported by vmstat, and hugely impressive random IO spikes when load gets heavy (though we haven’t gotten close to maxing out so who knows how high it could go). And, thanks to all this testing, I feel really good about having the optimal configuration for the hardware out there doing work.

Bulk Data Loading With PostgreSQL

2009 April 18
by Joe

Bulk data loading is usually a niche task - something that doesn’t have to be done too often but when it does it feels painful because it takes so long and if you mess something up it has to be done again. After loading dozens of 7 and 8 digit record count files into PostgreSQL databases I think it would be helpful to at least share the basic stuff that can make the process go a lot faster. The PostgreSQL documentation of course has everything you need for bulk data loading if you would rather read the official docs here but I consistently hear cases where the advice there isn’t used so I wanted to show some numbers. That link covers more techniques, all this post really adds is some numbers to back up the basics.

For this post I am working with 600k records being loaded into a table that looks like this. The data plus delimiters is about 60MB.

                 Table "public.people"
   Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
 id          | character varying(36)       | not null
 first_name  | character varying(255)      |
 last_name   | character varying(255)      |
 middle_name | character varying(255)      |
 client_id   | character varying(36)       |
 birth_date  | timestamp without time zone |
Indexes:
    "people_pkey" PRIMARY KEY, btree (id)
    "idx_people_client_id" btree (client_id)
    "idx_people_name_search" gin (to_tsvector('english'::regconfig,
      (((COALESCE(first_name,
      ''::character varying)::text || ' '::text) ||
      COALESCE(middle_name, ''::character varying)::text) || ' '::text) ||
      COALESCE(last_name, ''::character varying)::text))

A few points on this particular table and config:

  • The data is a subset of a real production table made up of entries people provided on themselves so the distribution is realistic and balanced.
  • The primary key id column is a UUID generated by code.
  • A very basic index on the client_id column.
  • A more complex expression gin index on the name fields to allow full text searching. Full text searching was added in PostgreSQL 8.3 and it’s pretty powerful. Read more about that here if interested.
  • Overall it is a relatively small, simple table but not trivially so and its large enough to get decent timings from bulk load operations for comparison purposes.
  • Performing tests against a clean DB, empty table, and freshly created indexes so we have a best case scenario.

So, let’s say you have that empty table and want to load 600k records in. Here is a sequence of methods going from worst to best. I am doing a VACUUM ANALZYE before and after each operation. All timings rounded up to nearest second. We are comparing INSERT and COPY.

First Setup: leave indexes and pkey in place.

INSERT: 6 minutes, 5 seconds

COPY: 3 minutes, 11 seconds.

Second Setup: drop/recreate indexes, don’t touch pkey

When timed it took 1 second to drop indexes and 13 seconds to rebuild them. I have added that overhead into these numbers.

INSERT: 3 minutes, 19 seconds

COPY: 1 minute, 5 seconds

Third Setup: drop/recreate indexes and pkey

When timed it took 1 second to drop indexes and 15 seconds to rebuild them (includes pkey). I have added that overhead into these numbers.

INSERT: 2 minutes, 49 seconds

COPY: 21 seconds

Conclusion

With that final setup, dropping/recreating all indexes and using the COPY command, I was able to load 600k records in 21 seconds. That includes the drop/recreate overhead. The actual COPY command only took 5 seconds.

General notes related to the tests done above:

  • The COPY command is magical. It completely blows away INSERT for bulk load performance.
  • Dropping and recreating indexes is far faster than leaving indexes in place and forcing thousands or millions of individual updates to them.
  • Do a VACUUM ANALYZE before and after bulk loading data to be safe. If you dropped indexes be sure to recreate them before doing the ANALYZE.
  • If you don’t think the speed improvement shown is significant keep in mind we are working with a narrow 6 column table, with just 3 indexes, and not very much data. The savings found by dropping and recreating indexes becomes enormous as your table height and width increases and/or your number of indexes goes up. Similarly if you are only working with thousands or low tens of thousands of rows you can just do whatever is easiest since the size is so trivial.

Some other notes worth mentioning:

  • The same rules apply for updates. ALL indexes slow down updates, not just the ones on the column you are updating so if you need to update an entire table dropping and recreating the indexes will provide a massive speed increase.
  • If you are doing a bulk update and using a field on the updated table for lookups you are generally better off dropping all indexes except the one on your lookup column. Say for each person in my people table I am looking up the specific person by id and setting their client_id. I would want to leave the index/pkey on the id column so those lookups were fast. You still pay a penalty at write time because the index/pkey must be updated but not having the index would mean a sequential scan is done on the entire table for each update.
  • Though incredibly fast the COPY command is very picky about its input. It is an all or nothing operation and if there is even a slight issue with your data it will fail. You will want to VACUUM your table after a failed COPY to recover space. Thankfully if it does fail it prints out the exact line number and reason.
  • When specifying a file for the COPY command you must use an absolute path and the file (and all directories in its path) must be accessible by the postgres user because it is the server that goes and reads the file not the client. I find it easiest to just toss my input files into /tmp.
  • Consider partitioning your data to allow drop/add of indexes on certain partitions preventing an entire table from having awful read performance while data is being loaded. Partitioning is a whole separate topic but read the docs, be sure constraint_exclusion is on in your postgresql.conf, and be sure your check constraint is in the where clause of all queries if you take this route. Partitions and significant bulk data loading fit together really really well.
  • Seriously read the PostgreSQL documentation on this stuff if you are bulk loading data. Aside from the additional SQL-level recommendations it covers the tuning options for your DB that can make a very significant difference.

Finally, you may find that expression index for full text searching in my people table above looks strange. I hope to do a post on indexing specifically that explains in detail. In my opinion tuning database indexes is a blast and made even greater by the amazing capabilities of PostgreSQL.

Connection Pooling: PgBouncer or JDBC

2009 April 4
by Joe

Why Use Connection Pooling

With recent customer additions we have had to deal with lots of scaling challenges the last few months. One puny fraction of that effort has been around tuning our database connection pools. In very short, if your database server is overwhelmed with concurrent connections it can get bogged down in context switching and resource/io waiting such that overall throughput or transactions per second can drop. In the best case things just get slow. In the worst case things completely lock up because the server just can’t service all of those connections or your client code starts throwing exceptions because the database has reached it maximum configured connection count.

One way to control the maximum number of connections hitting your database at once is via connection pooling. This can ensure a consistent load that your database can dispatch quickly to keep a better overall throughput going. You can also match your aggregate pool sizes to the maximum connection count configured in your database server to ensure your client code never asks for more than the DB can give. The other HUGE advantage to connection pooling is that you are essentially caching the overhead of obtaining new connections. To the database there is a set number of connections made and held open - the pooling software then distributes those pooled connections to clients that need them and when the clients are done the connection remains open and waiting to be used again. This avoids a surprisingly high amount of overhead.

There is definitely a trade off. If your application needs to have a certain number of connections simply because you have too many concurrent users it is time to tune your code/queries/indexes (HUGE gains can always be found there), push stuff into distributed caches, or buy new database hardware so you can service more connections. There are recommendations online of a connection count as low as 2-4 per core. We’ve settled in around 30 per core and that seems to be a sweet spot for stability and performance. With 8 cores we can sustain well over 1000 transactions per second on a single PostgreSQL server (running on decent but not exceptional hardware).

We use JDBC connection pooling with c3p0. Each of our application servers has its own pool. We ensure that the sum of those pool sizes matches the connections that the database allows and also equals our desired connections per core count.

We explored using PgBouncer instead. This differs in that the connection pooling happens on the server - clients just ask for as many connections as they need talking to PgBouncer instead of directly to PostgreSQL.

You cannot use both JDBC pooling and PgBouncer or other server-side pooling. Or more correctly you can but there is no point. Say you set PgBouncer to maintain a pool of 500 connections. If you have JDBC pooling in the mix that just means you have to tune your individual client JDBC pools such that the sum of their pools doesn’t exceed 500. If that is the case, why use PgBouncer? Just use PostgreSQL and set it to handle 500 connections. Instead you would drop JDBC pooling and just have your clients ask for whatever they wanted while PgBouncer did the pooling and connection management.

Comparison

In deciding to move to PgBouncer or stick with c3p0 I ran a few tests. Here are those results. The goal here was to compare the overhead of obtaining connections not to determine the overall best solution by throughput. That would require significantly different and more involved tests. We were considering a move to PgBouncer and just wanted to consider any difference in connection overhead. These tests were done with a pool size of 50 and with client code connecting to a separate database server on the same subnet and physical switch.

First Test
Ran 500 queries with each query being a simple select 1. This simple query means most of the time is connection overhead so its a good test to decide between pooling solutions.

c3p0 ComboPooledDataSource .99 s
JDBC no pooling and no PgBouncer 6.41 s
JDBC no pooling but with PgBouncer 3.99 s

Second Test
Ran 500 queries where the query is selected sequentially from an array of 5 distinct queries varying from very simple to very complex. Highly imperfect test, but trying to see how connection overhead factors into a more normal load.

c3p0 ComboPooledDataSource 122.18 s
JDBC no pooling no PgBouncer 139.71 s
JDBC no pooling but with PgBouncer 129.81 s

So c3p0 definitely offers the fastest solution. Additionally PgBouncer definitely makes a difference over not using connection pooling (and thus having to obtain a new connection with each query). This makes some sense as c3p0 is closer to the client code and keeps the pooled connections local. With PgBouncer the clients have to hit the network to queue up with the server.

Conclusion

We stuck with c3p0. But in my opinion where PgBouncer really shines is on the administrative side. As we continue to add more servers a move to PgBouncer or something similar seems likely. It (and PgPool-II, another solution) offer other capabilities as well. PgPool-II even offers built-in load balancing of reads against replicated copies of a DB.

Say you have 10 application servers and 1 database server configured for 500 maximum connections. You set the individual application server pool sizes to 50. Now you need to add a new application server. You would have to tune all 10 of your old servers dropping their pool sizes down to make room for the new one. This problem gets more and more severe as you add more servers.

Also, tuning individual pool sizes per client means there is a chance that one server is starving while others have pooled connections sitting idle - the connections aren’t shared. Probably not a big issue if you are load balancing user connections to your application servers evenly but still a possibility.

As you get more and more machines the server-layer pooling with PgBouncer definitely makes more sense. It also is obviously client language neutral so if you are running in an environment that doesn’t have mature pooling solutions its a great option.

OCZ Vertex SSD in a 17″ Santa Rosa MacBook Pro

2009 March 26
tags: , , ,
by Joe

My Reasons for Buying an SSD

I do all of my work on a MacBook Pro. It is a maxed out 17″ of the Santa Rosa variety - this one with the 2.4 Ghz CPU, 4GB RAM, and the higher res display. It was top of the line when purchased and still has pretty solid specs.

Lately I have been noticing it beach balling a lot, along with very frequent full-bore fan speed and noise. Additionally my IO read/write indicators (via menumeters) are constantly lit up. When I first got this laptop I do not remember any of that - it ran very silent and I barely ever noticed lag or stuttering.

I had originally thought this machine had the 7200rpm 160GB drive of the mid ‘07 revision but despite threads like this one confusing things, I am pretty sure the disk was actually the 5400rpm. The xbench comparison pages revealed that the disk tests were coming back with an aggregate score of high 30’s to low 40’s for people running Santa Rosas with the 5400 rpm drive and as you can see in the below benchmarks mine wasn’t even on par with those scores. As far as I could tell from Xbench I was running a 4200 rpm drive or a 5400 rpm drive that was dying with a score in the 20’s

My frustration with the Macbook’s performance happened to correspond in timing with a lot of discussion at the office about our production database IO performance, SAS drives, better controllers, etc. This turned into conversation about Solid State Disks (SSDs), the Intel X25s, and even the incredible Fusion-IO drives. For anyone interested the X25-E and Fusion-IO are compared here.

So I made up my mind to buy an SSD for this Macbook Pro. A tax return was coming, it was close to my birthday, so why not right? I had my mind set on the Intel X25-M 80GB and then Zach pointed me at this incredibly detailed review/overview of SSDs by AnandTech.com. That article is excellent and you should read it if in the market for an SSD. The big takeaways are:

  • Random IO plays a huge part in the performance of a computer under a typical work day load. Sequential IO (the speed advertised on the box) is far less important.
  • Many SSDs, notably all using a JMicron controller, have horrible random write performance - so bad that they are actually a step down from a standard disk.
  • Intel’s SSDs do not suffer from this issue and they perform the best of all SSDs available.
  • OCZ has made a few of the JMicron-based SSDs that have the terrible random write performance but their newest SSDs, including the OCZ Vertex, do not suffer from this issue and they perform well.
  • SSDs get slower once you have written to all available space at least once, but are still far faster than a standard disk once this happens.

The main gist is that an SSD is a very noticeable, solid upgrade. The OCZ Vertex does not perform quite as well as the Intel X25s (wins some, but loses more benchmarks in that AnandTech article) but you pay significantly less per GB. The Intel X25-M 80GB was about the same price as the OCZ Vertex 120GB (low $300s) when I made my purchase. The baseline jump from regular HD to SSD is huge so the incremental difference on top of that between the Vertex and the X25-M just didn’t seem like a big deal.

Preparation

To prepare for the OCZ Vertex I shrunk down my disk space usage. I was using 100GB and my initial direction towards the Intel disk meant I had to lose a significant portion of that. The OCZ eliminated this need but from their forums and the above AnandTech article leaving space empty is a good idea as it delays the point at which you have written over the entire drive.

I used a combination of Monolingual and Disk inventory X to achieve this. The first can delete languages and architectures you don’t care about - for me it freed up 4.2GB of space. The second provides a very useful visual display of your hard drive and makes it easy to identify and eliminate the big space killers. I found an unused parallels image, old irrelevant zip archives, unused applications and their data/caches (an old version of IntelliJ IDEA had a 3GB cache on disk, and a long ago uninstalled copy of MS Office left a couple GB as well). Unless you care about the speech capabilities of your laptop, you can also kill the Alex voice at /System/Library/Speech/Voices/Alex.SpeechVoice to net another ~700MB.

In total I got my usage down to 63GB and that includes 26GB of music I could offload to something else if I felt like it. Can’t recommend those two programs enough.

Installation

Despite being very comfortable swapping out computer hardware I was initially hesitant to do the install myself. This is because these particular MBPs are nontrivial to swap the hard drive of. But, after being told by Amro that it wasn’t too bad and also really wanting to get the swap done I drove to Home Depot, bought a Torx screwdriver set (T6 is what you need along with a tiny phillips) and did end up putting the drive in myself. It took me ~40 minutes but I was working slowly and carefully to ensure I didn’t jack anything up. I printed up the PDF of these instructions and they were complete and clear. My notes from this process:

  • In Step 10, the ribbon cable was taped down pretty good. I’d recommend loosening the tape and disconnecting the cable with tape still attached if you can.
  • In Step 15 the SATA connector had a couple pieces of tape around it that wrapped up underneath the drive.
  • I used a cheap 2.5″ disk enclosure and SuperDuper to copy my existing drive to the OCZ after formatting it with Disk Utility. This worked flawlessly. After the copy I just swapped the drives, turned on my machine, and I was at 100% again - everything exactly as it was.
  • Go slow and keep track of the screws you remove by clearly labeling them on a sheet of paper. You have to remove (and replace) about 30 screws.
  • From reading around, and ignore this, the process doesn’t void your warranty (again, what I have read, not what I am saying). But, if you take your laptop to the Apple store and they see a non-spec hard drive or any damage they can refuse to do work on it so be prepared to put your old drive back if you need to get it serviced and be very careful so there isn’t any way for them to tell you were in the case. If you are not comfortable doing it yourself, I’d suggest Atlanta Pro Audio - reasonable price and they can do it while you wait. They were recommended and used by multiple coworkers.

Here’s a shot of the laptop guts with it all opened up, also the point where it is too late to turn back.

Now on to the actual numbers and whole reason for this post.

Performance Notes

  • The data on my hard drive was unchanged between non-SSD and SSD tests - exact same content.
  • In the “launch apps” tests below I am loading my minimal stack of daily use applications - IntelliJ IDEA, Textmate, Terminal, TaskPaper, Mail, iCal, PostgreSQL 8.3 server, Firefox, Tweetdeck, Adium, iTunes, and two Fluid apps.
  • I launched the applications via Spotlight in the same order as quickly as lag/Spotlight allowed. Then checked on the biggest apps until fully working. IntelliJ IDEA dominates this time as it by far takes the longest to fully load (it loads up with our huge Java project setup by default). The timing is highly subjective, but this is the period of time that passed between login and when I felt all applications were responsive and usable.
  • Xbench is available at Xbench.com. You can view posted results from other people there too. My Xbench runs were done after a clean reboot and with nothing else running.
  • I had planned to do dd and bonnie++ tests as well - actually did run them on the old drive but after reading the OCZ forums and considering that I had to write a total of 16 GB to disk to run those tests to get real results I decided it wasn’t worth chewing on the SSD just for a benchmark. Once the SSD has proven stable for a longer period of time i’ll run those tests to compare. For now I just have Xbench. I don’t know if it is accurate at all (compared to dd/bonnie++) but it is commonly used and provides a way to do comparison. I got different results with Xbench on every single run, used middle of the pack runs for numbers below..

Performance Before SSD Install

Bootup results
Time to boot up to login screen: 83 seconds
Time to login and launch all apps: 526 seconds

Xbench results
Total system score (all tests): 99.01
Link to all pre-SSD results

The disk-specific numbers:

Performance After SSD Install

Bootup results
Time to boot up to login screen: 24 seconds
Time to login and launch all apps: 64 seconds (WOW - compare to above)

Xbench results
Total system score (all tests): 185.23
Link to all results

The disk-specific numbers:

In addition to an utterly amazing performance increase my laptop is completely silent now. As I type this post I can hear absolutely nothing. Without menumeters telling me when reads/writes are happening I would never know the disk was doing work.

Conclusion

The OCZ Vertex makes this laptop an entirely new machine. It makes a HUGE difference. Best money I have ever spent on a computer upgrade. These things, in my opinion, are the biggest advance in computing hardware to come along in quite awhile. And over the next few years they will get faster, bigger, and cheaper. Prices have literally dropped in half in just the last few months so if you aren’t in dire need of a new disk it is probably best to wait longer.

The biggest, most noticeable part of running with the SSD installed is boot time and application launch/quit time. Applications load instantly. I used to open up the Applications folder in Finder when I rebooted because Spotlight would lag and that way I could start all my apps from one place (I don’t keep anything in the dock - only shows running programs). With the SSD, Spotlight is functional immediately and when I tried to start everything from the Finder I could not because the Apps fired up so dang fast I could not keep my cursor in the Finder window.

I have only been running with the SSD in for one day so that is the context this post is written in. There are scary posts in the OCZ forums of people (including several OSX users) having really rough issues with this SSD in - freezes, data corruption, etc. Notably it appears people have increased difficulty with sleeping their machines, using Bootcamp, or running Windows VMs. I don’t use Bootcamp but can say I have had zero issues with sleeping/waking or running Windows VMs in Parallels (the VMs load wickedly fast now). And to their credit the OCZ guys are tenacious in response, appear to be completely willing to assist and replace, and definitely seem to be standing by their drives. I suspect it is a minority of users having issues but just take it as a warning. Reading those forums for awhile made me really nervous about my purchase (and was a contributing factor to swapping the drive myself - so I can get it out if need be). I do thorough backups - hopefully I won’t need them. After using the laptop for a full day with my normal, heavy load I feel a lot better about it being stable.

Also, the OCZ forums include a lot of people tuning their partitions, flashing various firmware versions, and tweaking settings heavily to get more performance. I intentionally did none of that. I just copied my stuff to the drive and swapped it in so take this as a naive but straight forward and realistic install. My OCZ Vertex is running firmware 1199. There is a newer version out but as far as I can tell you need a PC to update it. I’ll only worry about that if I have issues.

So overall - very happy. My laptop is a lot faster. Biggest upgrade (by feel) I can ever remember doing to a computer. Today was the first day in a long time where I had virtually no hesitation or beach balling from the MBP.

Monitoring Dell Perc5 and Perc6 Disks in Arch Linux

2009 March 11
by Joe

One of the downsides to hardware raid is that it is not as easy to monitor as software raid. Monitoring individual disk status requires proprietary software made to match the hardware. This is the position you will be in if buying Dell equipment with their Perc5/Perc6 controllers. The reason is that to the OS your big raid array is just a single big disk - the hardware controller masks knowledge of the individual disks and their status. You can monitor the Dell disks though, you just need the matching software. This will work for probably any Linux distribution and I suspect the earlier Perc controllers as well.

It is definitely a good idea to monitor your individual disks, else you could have one fail and not even know your raid array is operating in a degraded state. When a disk fails you want to act quickly (and preferably have hot spares configured in your controller) because disks manufactured in the same batches are rumored to frequently fail around the same time. Though I have not experienced this personally, it seems plausible.

The first option you might happen upon is installing Dell’s Open Manage software. That stuff is pretty bloated and you have an adventure ahead of you in getting that to install if you aren’t running Redhat or SuSE (or perhaps Debian for which a repacked set of .debs seems to popup quickly after a new version is released).

The other option and one I will shoot through here is using the Megaraid CLI tool from LSI. The Perc controllers in the Dells apparently are rebranded LSI controllers so you can use this command line tool to extract the good information for monitoring purposes. Here’s the steps to installing it.

Installation

  1. Grab the Megaraid CLI program from LSI. At the time of this post you can get it at http://www.lsi.com/DistributionSystem/AssetDocument/4.00.11_Linux_MegaCLI.zip.
  2. Put the downloaded .zip file on your server somewhere in a temporary location and unzip it.
  3. Unzip the zip file that is unpacked (guess the double zip is for good luck?)
  4. Install rpmextract. That is pacman -Sy rpmextract on Arch Linux.
  5. Unpack the .rpm that was in the innermost .zip file with /usr/bin/rpmextract.sh MegaCLI-4.0.0.11-1.i386.rpm
  6. mv the resulting MegaRAID directory to /opt/MegaRAID.

Note that if you are running one of those fancy rpm-based Linux distributions you will obviously skip the rpmextract part :)

That will do it for the install. Now on to Usage.

Usage

You can run /opt/MegaRAID/MegaCli/MegaCli64 -h to see all the available options. There are a ton and that help output is completely not useful at all. This tool appears to be able to both query the status of disks as well as perform operations against them. I’ve only used it for monitoring and really only use this one command:

/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllnfo -aAll

This will generate a ton of output on each of the compatible controllers in your server. If you have just a server (that isn’t attached to extra storage) you will likely only have 1 controller. If you have an MD1000/3000 or other direct attached storage connected via an extra Perc adapter you could have multiple controllers.

In all of that mess of output there will be a “Device Present” section for each controller. In that section you will see output like this:

                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 0
  Offline         : 0
Physical Devices  : 16
  Disks           : 15
  Critical Disks  : 0
  Failed Disks    : 0

You can see 1 virtual drive listed - this is the big single drive the OS sees. You can also see its status there - Degraded is thankfully 0. If it was greater than 0 it would mean your “Virtual Drive” is degraded likely meaning a disk has dropped out. I suppose Offline would mean your array is fried due to multiple disk failure or isn’t being used at all.

Additionally you can see the physical drives and their status. Thankfully none of mine are Critical or Failed. I can’t say I understand why it says there are 16 drives when there are only 15 unless it is recounting the virtual drive for some reason.

In any case, you can see that by checking those counts you can pick up whether your raid array is operating in a degraded state. Here is a command that grabs just the “Degraded” number from the “Virtual Drives” section:

./MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -a0 | grep "Virtual Drives" -A 1 | awk 'END {print $3}'

Even better, here is one that sums up the “Degraded” number for all controllers. This one is more flexible as it is equally useful on a server with a bunch of different controllers and raid arrays:

./MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aAll -NoLog | grep -A 2 "Virtual Drives" | awk '/Degraded/ {TOTAL += $3} END {print TOTAL}'

The output of that last command is a single integer. That would be easy to snap into your monitoring software of choice - my preference is zabbix so I would add this to my /etc/zabbix/zabbix_agentd.conf file:

UserParameter=custom.megaraid.degradedCount,/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aAll -NoLog | grep -A 2 'Virtual Drives' | awk '/Degraded/ {TOTAL += $3} END {print TOTAL}'

Now bounce your zabbix agent, add the “custom.megaraid.degradedCount” as an item to your server(s) in the web interface and you are set.

This works really well. I tested it out by yanking disks out of running servers and watching their degraded count jump up. You may want to do the same just to ensure it is working end-to-end but don’t say I told you to do it and don’t yank a disk out of an array that can’t handle multiple disk failures. If you do and another disk actually fails while your test disk rebuilds you will be in rough shape.

PostgreSQL Setup Basics

2009 March 8
by Joe

There is a seemingly endless amount of information you can learn about PostgreSQL usage, administration, and tuning. The amount of information out there can be overwhelming but just learning about a small piece of it can allow you to get dramatically better performance. This post is an effort to provide something you can read quickly and get some benefit from. This is purely my opinion, you should spend some time reading the excellent PostgreSQL documentation, reviewing proper tuning guides, and reading the PostgreSQL performance list for more information. I’ve tried in this post to keep things at a high level and omitted a lot of details to try and make it shorter and easier to read.

On all of the below the PostgreSQL mailing lists are invaluable. You can ask questions about your hardware, queries, performance, whatever and get utterly helpful and complete information from incredibly knowledgeable people many of whom are the developers behind PostgreSQL.

Hardware

PostgreSQL scales up beautifully as you feed it increasingly powerful hardware and can handle enormous concurrent load. Here is how hardware components rank in importance.

  1. Disk IO is king. Especially if your database does not fit in memory. You want a lot of spindles (individual disks), a good raid controller with a battery-backed cache, and raid 10 is recommended almost universally. Fancy CPUs and lots of RAM are a complete waste of money if your disk setup is not up to snuff. Faster individual disks are obviously better too.
  2. Memory is next important. Almost as important as disk IO if your database fits completely in memory but that isn’t a condition that sticks around forever or that you can normally rely on. The more memory the better. Several of the configuration parameters in postgresql.conf are direct percentages of total system memory.
  3. CPU is still important, especially if your database is doing a lot of number crunching and reporting/complex queries, but isn’t going to be your bottleneck in most cases. 8 cores is easy to achieve now, just buy whichever Xeon CPU hits the price/performance sweet spot. Don’t pay huge for bleeding edge GHz.

Mainstream vendors should be sufficient for all but massive databases. Our database holds hundreds of gigs of data and does thousands of transactions a second and it runs on plain Dell equipment (2950 linked to a MD1000, 20 disks total).

Obviously if you are looking to purchase a server for running PostgreSQL do more research than this post before pulling the trigger. Sometimes you can even get a vendor to let you borrow a unit for a short period of time to benchmark on if it is sufficiently expensive.

Disk Configuration

I strongly recommend running PostgreSQL on its own, dedicated server unless you are expecting trivial load. When you setup your partitions during the OS installation you should be mindful of your disk spindles. You want to have disks dedicated to doing nothing except database writing and reading. Here’s a list of a few things to think about in terms of IO usage:

  • Your operating system.
  • The actual PostgreSQL data that is written/read.
  • PostgreSQL logging (lots of log types - transaction, standard, write-ahead).

Considering the above, try to have them each on separate disk arrays. If that isn’t possible, try to at least separate the OS from PostgreSQL. The standard logs for PostgreSQL can be tuned and configured in postgresql.conf. The transaction logs cannot but they get written to $PGROOT/pg_xlog. You can use a symlink to point pg_xlog wherever you want, just be sure to shut PostgreSQL down before making that change and preserve any existing files in there in the move.

Lets say you buy a six disk server to run your database on. You could setup two in a raid 1 for your OS and the remaining four in a raid 10 for all your PostgreSQL stuff. More disks gives more flexibility and more performance but we ran on that smaller configuration for awhile and it worked pretty well. Consider going with 2.5″ drives as you can get more disks in the same server chassis.

A final note on this, PostgreSQL supports the notion of tablespaces. Tablespaces can be mapped to separate physical disk locations allowing you to put certain tables or indexes on different physical disk arrays. This can be useful for splitting up the IO load of the server or for doing things like moving heavily hit indexes or tables to a faster more expensive set of disks (that perhaps would have been too expensive to make large enough for the entire DB). We do this. We have a bigger raid 10 of slower SATA drives for most of our data and then a smaller raid 10 of 15K SAS drives for a few of the hard hit items.

Installation

In Arch Linux installing PostgreSQL (via pacman -Sy postgresql) creates an /etc/rc.d/postgresql script for starting and stopping the database. The first time you “start” that script it calls initdb if the configured $PGROOT does not exist. Subsequent times it simply does a pg_ctl start. You will get something different but similar (probably /etc/init.d/postgresql) in other distributions.

Once your database as been initialized all of the important pieces of your PostgreSQL installation (aside from the command line tools) will be underneath the $PGROOT. This includes the actual data stored in your DB, your postgresql.conf file, directories for the various PostgreSQL logs, and a few other configuration files.

Several command line tools get put in /usr/bin/. Here are a few of them and their purpose. The PostgreSQL documentation has in depth instructions and information on each of these as well as the ones I am omitting.

  • pg_ctl - This is what gets used to actually start and stop PostgreSQL. It is often not called directly because an init script uses it internally.
  • initdb - Is used one time to initialize the $PGROOT for your PostgreSQL install. Can be directed to initialize in a specific location and you can control the locale used (I recommend going with UTF-8) with its various parameters. This one also generally won’t be called directly but by an init script in your Linux distribution. If you need to call it yourself the most common command (for me) is /usr/bin/initdb -D [MyPGROOT] -E UTF8 –locale en_US.UTF-8.
  • pg_dump - Tool for dumping a specific database in your PostgreSQL install. This one dumps only a single database and no global/cluster settings like tablespaces or users.
  • pg_dumpall - Tool for dumping your entire PostgreSQL install to a file. Includes users, tablespaces, permissions, everything. Is best choice if dumping for the purpose of moving an entire install to a new machine or to a new version of PostgreSQL.
  • pg_restore - Tool for restoring a file produced by pg_dumpall or pg_dump.
  • vacuumdb - Tool for doing a vacuum (and optionally an analyze with the –analyze parameter) from the command line.
  • Database Configuration

    The tuning guide linked at the top of this post is an awesome walkthrough of the various parameters so there is no point in mentioning anything about that here. Go read that.

    Some settings can be changed and reloaded without restarting your DB while others cannot. The postgresql.conf file itself notes which ones require a restart. For the ones that do not require a restart you can make the changes and then use pg_ctl to reload the config like this:

    /usr/bin/pg_ctl reload -D /path/to/$PGROOT

    If it does require restart you will need to completely bounce PostgreSQL. In Arch Linux this is simply /etc/rc.d/postgresql restart and you will do something similar in other distributions.

    Also, the combination of your shared_buffers setting and your max_connections setting may require you to make a change to the shmmax setting of your server’s kernel params. Basically it is the largest chunk of shared memory an application can grab at once. PostgreSQL will print out a clear error at startup if this parameter is set to an insufficient level but doesn’t give any pointer on how to change it.

    To change it do a cat /proc/sys/kernel/shmmax to check the current value. The value is in bytes. If it is too small you can edit /etc/sysctl.conf and add a line at the bottom containing kernel.shmmax = myNewValue. Then run sysctl -p and you should be good. I am no expert on tuning these kernel parameters but would say don’t raise that value above 50% of your memory. If you need more than that for the PostgreSQL configuration you want you need to buy more RAM.

    A very important piece to a basic PostgreSQL configuration is to ensure you are vacuuming and analyzing your database regularly to clear out dead tuples, keep the planner aware of the lastest changes, and keep your queries fast. This page covers the details if you want more technical bits about why this is important.

    On top of running autovacuum at an appropriately aggressive level I would recommend setting up a job on your server that runs every night during off-peak times which does a VACUUM ANALYZE on the entire DB. Your DB can still be used while this is happening though performance will take a bit of a hit. You can use the command line tool vacuumdb provided with PostgreSQL to easily include this in a shell script. It’ll look something like this:

    /usr/bin/vacuumdb -v --analyze [Database Name]

    Lastly, you absolutely need to do backups. Raid is not a backup. Mirroring your data even across machines or disk arrays is also NOT a backup alone (what happens if you accidentally delete a table and that gets mirrored right on over to the backup?). Your plan should at minimum include using pg_dump to do a regular backup of everything and store it in at least 2 locations and at best also include using PostgreSQL’s Point In Time Recovery (PITR) features to keep a warm standby. Our setup in production includes PITR to an almost equivalent server that we can fail over to instantly on failure as well as a pg_dump that is saved to 2 servers in the production cluster and also uploaded to S3 every night (after breaking it into sub-5GB chunks, the S3 maximum files size, with the split command). PostgreSQL has never failed on me, never shutdown because of an error, and never corrupted any data but you can never be too safe with your production database. There are occasionally very public stories of companies basically dying when their database fails and their backups are weak or nonexistant.

    Conclusion

    PostgreSQL is a sophisticated piece of software. It has some utterly amazing engineering behind it but also a lot of complexity. Hopefully this post can help someone get started that is having trouble finding concrete information. If you are setting up a new PostgreSQL server you want to make good hardware choices, install the very latest version your package manager allows, spend some time tuning the configuration, and setup good maintenance and backup scripts that run at regular intervals.

    At some point I will follow up this post with two more - one on query and index tuning (the awesomely fun part in my opinion) and another on setting up and using PITR.

Monitoring PostgreSQL TPS with Zabbix

2009 February 21
by Joe

I have become a big fan of Zabbix. It is a full-featured monitoring and notification system for keeping an eye on servers, gear, websites, whatever. The interface can be completely confusing but it is incredibly powerful and I have found I prefer it over the other systems I have tried (Nagios, Zenoss, Cacti). I hope to do an in depth post on Zabbix eventually. I have it doing a ton of stuff including tracking CPU utilization and disk IO (awesome IO monitoring thanks to this page) on all servers, detecting outtages and automatically fixing them, monitoring and charting the response time of our websites, all along with notifications via email and text if things ever get out of whack.

This post assumes you already know about Zabbix, already have Zabbix server setup, already have a Zabbix agent on your DB server, and want to monitor the Transactions Per Second of a PostgreSQL database. If you are looking for more general help the Zabbix forums and manual are pretty decent.

Agent Setup

On your PostgreSQL server fire up psql. Now run select * from pg_stat_database. Check out that output. Among other things there is a “xact_commit” column that contains a count of all transactions committed for each database. That is the number we will be using. If you are running PostgreSQL 8.3 or later you will also have columns for tuples_fetched, tuples_inserted, etc which could be pretty interesting to also monitor and chart.

Now that you see what is going to be used for the raw data, edit your zabbix-agentd.conf file. Mine is at /etc/zabbix/zabbix_agentd.conf. Go down to the “USER-DEFINED MONITORED PARAMETERS” section at the bottom.

Add a new UserParameter that looks like this:

UserParameter=custom.postgresql.tps,psql [dbname] -A -t -U [dbuser] -c “select sum(xact_commit) from pg_stat_database”

The format on the right is item key name, command. You can change the item name if you want, and will want to substitute in a [dbname] and [dbuser] that works for your DB.

Now restart your Zabbix agent. On Arch Linux this would be /etc/rc.d/zabbix-agentd restart.

Server Setup

Now for the Zabbix server side login to your Zabbix administrative web interface. Go to the “Configuration -> Items” section, select your DB server for the host, and hit “Create Item.” Here is a screenshot of my settings here.

PostgreSQL TPS Item

Head over to the “Configuration -> Graphs” section now, select your DB server for the host, and hit “Create Graph.” Dimensions and name can be whatever. Add the item you created above to the graph and save it.

Conclusion

That should be it. You can now throw that graph on a screen for monitoring, setup a trigger based on the item (perhaps get a notification if TPS drops very low or gets very high), etc. As suggested earlier it would probably also be very interesting to monitor and plot those “tup_X” columns in pg_stat_database as well to get a constant idea of what your DB is doing most of.

I won’t give our specific numbers, but will say I was mighty impressed with our TPS especially considering our DB server isn’t working too hard on the CPU/disk end of things. Suspect there is a point where the percentage of stuff in RAM will decrease as the DB size increases and then the disk will start working more. But, thanks to Zabbix I will be completely aware of if/when that happens because I have a screen hanging on my wall with all the relevant details as the server runs.

Introduction to psql

2009 January 16
by Joe

PostgreSQL is an awesome piece of software. Superior in almost every way to MySQL and many of the commercial database options out there. Though I likely am abnormal in this - it and psql (its command-line interface) are literally some of my favorite things. To drive this home, here is the list of my top 5 commands based on my recent history at this moment.

  1. ssh (110)
  2. ls (93)
  3. cd (76)
  4. emacs (35)
  5. psql (28)

By the way, to figure out your top commands give this a try.

history | awk '{print $2}' | awk 'BEGIN {FS="|"}{print $1}' | sort | uniq -c | sort -n | tail | sort -nr

Notice that psql cracks the top 5. As already stated, it is the CLI for PostgreSQL and it is fantastic. The goal of this post is to provide a quick introduction to using psql. I won’t be talking about PostgreSQL the database - the official docs are sufficient for almost any question you may have and the mailing lists will fill in any gap. This is a very brief introduction.

The Ultimate Admin Interface

Who needs to build admin interfaces when you have psql and SQL?

In general psql is all about being efficient and using minimal keystrokes.

To view the list of available databases simply run psql -l from the command line. To then connect to a database run psql [database] (e.g. psql startups).

To connect as a specific user use psql [database] -U [user] (e.g. psql startups -U joe).

Once connected you can start writing queries and issuing commands. The basics:

  • \q - quit
  • \? - help with psql commands
  • \help - help covering all available SQL language constructs
  • \d - list of tables in current database
  • \d [table] (e.g. \d people) - list of a table’s columns and indexes

Each query you execute is wrapped in its own transaction. You can custom control the duration of a transaction however you like by manually typing BEGIN and then either COMMIT or ROLLBACK. This is generally a VERY good idea when running commands against a production server. PostgreSQL is amazing so you can do almost anything you want in a transaction and roll it back if you mess up. An example.

BEGIN;
delete from people;
drop table companies;
alter table people rename first_name to nickname;
ROLLBACK;

The rollback undoes everything. In general you should be nervous and cautious when not in a transaction (the all-the-time state in MySQL’s default) and happy and safe when in a transaction.

Getting Data Out

If you need to get data out of your database and don’t just want a raw dump using pg_dump, psql has some great commands for doing that. The ones I use all the time follow. I do this extensively for producing custom reports.

  • \a - toggles unaligned mode meaning just data-delimiter-data, no spaces or junk to make it print nicely
  • \f [character] (e.g. \f ,) - changes delimiter to the provided character
  • \o [filename] (e.g. \o my_data.txt) - specifies a file to which all output should be redirected instead of printing it to stdout

You can combine the above to get whatever data or report you need. As an example, say I need the first and last name of everyone in the database. The sequence of commands once inside psql would be something like this.

\a
\f ,
\o engineers.csv
select first_name, last_name from people;
\q

The result is a comma-delimited file named engineers.csv written to your current directory. I use this a lot, you can have it generate HTML files as well.

Even Faster

psql can do a lot from the terminal without even having to enter interactive mode. Run psql --help for a full list. The most common scenario is wanting to run a query and get the output instead of having it printed out in interactive mode. For this use the -c flag. A simple example:

psql -c "select * from startups"

That example above of generating a file could have also been achieved this way by using a few more flags:

psql -A -F , -o engineers.csv -c "select first_name, last_name from people"

Here is a handy script that combines the ‘watch’ command with psql to produce an auto-updating display of currently running queries in a database.

watch -n 2 -d 'psql wildfire -c "select waiting, procpid, query_start, current_query from pg_stat_activity where current_query != '\''<IDLE>'\'' order by query_start" | grep -v "select waiting"'

It works by pulling the relevant information from the pg_stat_activity system table, removing lines that are just idle connections (not idle in transaction which is different and generally bad) and then watches that output, updating every 2 seconds.

Conclusion

That was a super quick introduction and only covered a few bits but honestly the above is the bulk of what I need on a daily basis (along with a good understanding of SQL and PostgreSQL’s planner). As with the PostgreSQL documenation the psql and SQL documentation accessible from within psql is incredibly good and thorough so feel free to search around in there if you need to do something specific.

Startups and The Cloud

2009 January 1
tags:
by Joe

The “cloud” has really taken off this past year. Despite a few outages it is obvious that the players involved are very serious about making it work and work well. It started as a place to store things but there are now companies putting their database, their messaging, and everything else in the cloud. Part of me is bummed about the future because buying and playing with servers and hardware is a lot of fun, but I can recognize a shift when I see it. The cloud seems like a perfect companion to startups, allowing a company to grow linearly as customers and business warrant instead of having to drop painfully large clumps of capital on gear (leasing helps some). It also removes uncertainty about being able to handle future load - predicting future load as a startup with no history is really tough. You either buy hardware too early (big $$$ wasted) or don’t have enough and your services go down. With the cloud yet another obstacle to bootstrapping is removed.

Things Have Changed Fast

The wild part to me is the speed with which this has happened. I helped start my current company ~3 years ago (wow that feels like a long time). We moved into our first office about 6 months after starting. At first it was two of us on our personal laptops working on the borrowed desks of a local consultancy that was helping build version 1 of the software. When we moved into our own office, I built a machine from parts and setup Exchange and that was the start of our “corporate IT”. Production was served up by leased servers colocated at a hosting facility through a reseller. Putting things in the cloud wasn’t an option at all.

We’ve changed the infrastructure a lot (want to blog more about the things tried, what has worked best, and the current setup - possibly at the company blog if not here) but it has always been services installed and hosted on servers I setup. We have a good bit of hardware now so I don’t see us going 100% AWS or similar, but am starting to feel like exploring more what can be pushed off our gear is a priority. We use google docs a lot as a company but could be doing a lot more.

Tomorrow’s Startups

My main point in this ramble is simply to state that if I was starting a new company today, it would make complete sense to put almost everything in the cloud. My paranoid personality would probably cause a server to be in place for a fast, local file share and backup for data/files in the cloud but most storage, production, qa, source control, documents, corporate email, etc could be pushed off to whatever cloud players makes the most sense (weighing more management/oversight/support with cost and staff skills/time). It made sense for IT items some time ago but now makes just as must sense for products and production.

It makes me want to hack on something on the side that goes out of its way to leverage the cloud just so I can force myself to get more familiar with it. Also, Cloud Camp Atlanta sounds cool.