PostgreSQL Setup Basics
There is a seemingly endless amount of information you can learn about PostgreSQL usage, administration, and tuning. The amount of information out there can be overwhelming but just learning about a small piece of it can allow you to get dramatically better performance. This post is an effort to provide something you can read quickly and get some benefit from. This is purely my opinion, you should spend some time reading the excellent PostgreSQL documentation, reviewing proper tuning guides, and reading the PostgreSQL performance list for more information. I’ve tried in this post to keep things at a high level and omitted a lot of details to try and make it shorter and easier to read.
On all of the below the PostgreSQL mailing lists are invaluable. You can ask questions about your hardware, queries, performance, whatever and get utterly helpful and complete information from incredibly knowledgeable people many of whom are the developers behind PostgreSQL.
PostgreSQL scales up beautifully as you feed it increasingly powerful hardware and can handle enormous concurrent load. Here is how hardware components rank in importance.
- Disk IO is king. Especially if your database does not fit in memory. You want a lot of spindles (individual disks), a good raid controller with a battery-backed cache, and raid 10 is recommended almost universally. Fancy CPUs and lots of RAM are a complete waste of money if your disk setup is not up to snuff. Faster individual disks are obviously better too.
- Memory is next important. Almost as important as disk IO if your database fits completely in memory but that isn’t a condition that sticks around forever or that you can normally rely on. The more memory the better. Several of the configuration parameters in postgresql.conf are direct percentages of total system memory.
- CPU is still important, especially if your database is doing a lot of number crunching and reporting/complex queries, but isn’t going to be your bottleneck in most cases. 8 cores is easy to achieve now, just buy whichever Xeon CPU hits the price/performance sweet spot. Don’t pay huge for bleeding edge GHz.
Mainstream vendors should be sufficient for all but massive databases. Our database holds hundreds of gigs of data and does thousands of transactions a second and it runs on plain Dell equipment (2950 linked to a MD1000, 20 disks total).
Obviously if you are looking to purchase a server for running PostgreSQL do more research than this post before pulling the trigger. Sometimes you can even get a vendor to let you borrow a unit for a short period of time to benchmark on if it is sufficiently expensive.
I strongly recommend running PostgreSQL on its own, dedicated server unless you are expecting trivial load. When you setup your partitions during the OS installation you should be mindful of your disk spindles. You want to have disks dedicated to doing nothing except database writing and reading. Here’s a list of a few things to think about in terms of IO usage:
- Your operating system.
- The actual PostgreSQL data that is written/read.
- PostgreSQL logging (lots of log types – transaction, standard, write-ahead).
Considering the above, try to have them each on separate disk arrays. If that isn’t possible, try to at least separate the OS from PostgreSQL. The standard logs for PostgreSQL can be tuned and configured in postgresql.conf. The transaction logs cannot but they get written to
$PGROOT/pg_xlog. You can use a symlink to point pg_xlog wherever you want, just be sure to shut PostgreSQL down before making that change and preserve any existing files in there in the move.
Lets say you buy a six disk server to run your database on. You could setup two in a raid 1 for your OS and the remaining four in a raid 10 for all your PostgreSQL stuff. More disks gives more flexibility and more performance but we ran on that smaller configuration for awhile and it worked pretty well. Consider going with 2.5″ drives as you can get more disks in the same server chassis.
A final note on this, PostgreSQL supports the notion of tablespaces. Tablespaces can be mapped to separate physical disk locations allowing you to put certain tables or indexes on different physical disk arrays. This can be useful for splitting up the IO load of the server or for doing things like moving heavily hit indexes or tables to a faster more expensive set of disks (that perhaps would have been too expensive to make large enough for the entire DB). We do this. We have a bigger raid 10 of slower SATA drives for most of our data and then a smaller raid 10 of 15K SAS drives for a few of the hard hit items.
In Arch Linux installing PostgreSQL (via
pacman -Sy postgresql) creates an
/etc/rc.d/postgresql script for starting and stopping the database. The first time you “start” that script it calls
initdb if the configured $PGROOT does not exist. Subsequent times it simply does a
pg_ctl start. You will get something different but similar (probably /etc/init.d/postgresql) in other distributions.
Once your database as been initialized all of the important pieces of your PostgreSQL installation (aside from the command line tools) will be underneath the $PGROOT. This includes the actual data stored in your DB, your postgresql.conf file, directories for the various PostgreSQL logs, and a few other configuration files.
Several command line tools get put in
/usr/bin/. Here are a few of them and their purpose. The PostgreSQL documentation has in depth instructions and information on each of these as well as the ones I am omitting.
pg_ctl– This is what gets used to actually start and stop PostgreSQL. It is often not called directly because an init script uses it internally.
initdb– Is used one time to initialize the $PGROOT for your PostgreSQL install. Can be directed to initialize in a specific location and you can control the locale used (I recommend going with UTF-8) with its various parameters. This one also generally won’t be called directly but by an init script in your Linux distribution. If you need to call it yourself the most common command (for me) is
/usr/bin/initdb -D [MyPGROOT] -E UTF8 --locale en_US.UTF-8.
pg_dump– Tool for dumping a specific database in your PostgreSQL install. This one dumps only a single database and no global/cluster settings like tablespaces or users.
pg_dumpall– Tool for dumping your entire PostgreSQL install to a file. Includes users, tablespaces, permissions, everything. Is best choice if dumping for the purpose of moving an entire install to a new machine or to a new version of PostgreSQL.
pg_restore– Tool for restoring a file produced by
vacuumdb– Tool for doing a vacuum (and optionally an analyze with the –analyze parameter) from the command line.
The tuning guide linked at the top of this post is an awesome walkthrough of the various parameters so there is no point in mentioning anything about that here. Go read that.
Some settings can be changed and reloaded without restarting your DB while others cannot. The postgresql.conf file itself notes which ones require a restart. For the ones that do not require a restart you can make the changes and then use
pg_ctl to reload the config like this:
/usr/bin/pg_ctl reload -D /path/to/$PGROOT
If it does require restart you will need to completely bounce PostgreSQL. In Arch Linux this is simply
/etc/rc.d/postgresql restart and you will do something similar in other distributions.
Also, the combination of your shared_buffers setting and your max_connections setting may require you to make a change to the shmmax setting of your server’s kernel params. Basically it is the largest chunk of shared memory an application can grab at once. PostgreSQL will print out a clear error at startup if this parameter is set to an insufficient level but doesn’t give any pointer on how to change it.
To change it do a
cat /proc/sys/kernel/shmmax to check the current value. The value is in bytes. If it is too small you can edit
/etc/sysctl.conf and add a line at the bottom containing
kernel.shmmax = myNewValue. Then run
sysctl -p and you should be good. I am no expert on tuning these kernel parameters but would say don’t raise that value above 50% of your memory. If you need more than that for the PostgreSQL configuration you want you need to buy more RAM.
A very important piece to a basic PostgreSQL configuration is to ensure you are vacuuming and analyzing your database regularly to clear out dead tuples, keep the planner aware of the lastest changes, and keep your queries fast. This page covers the details if you want more technical bits about why this is important.
On top of running autovacuum at an appropriately aggressive level I would recommend setting up a job on your server that runs every night during off-peak times which does a
VACUUM ANALYZE on the entire DB. Your DB can still be used while this is happening though performance will take a bit of a hit. You can use the command line tool
vacuumdb provided with PostgreSQL to easily include this in a shell script. It’ll look something like this:
/usr/bin/vacuumdb -v --analyze [Database Name]
Lastly, you absolutely need to do backups. Raid is not a backup. Mirroring your data even across machines or disk arrays is also NOT a backup alone (what happens if you accidentally delete a table and that gets mirrored right on over to the backup?). Your plan should at minimum include using pg_dump to do a regular backup of everything and store it in at least 2 locations and at best also include using PostgreSQL’s Point In Time Recovery (PITR) features to keep a warm standby. Our setup in production includes PITR to an almost equivalent server that we can fail over to instantly on failure as well as a pg_dump that is saved to 2 servers in the production cluster and also uploaded to S3 every night (after breaking it into sub-5GB chunks, the S3 maximum files size, with the split command). PostgreSQL has never failed on me, never shutdown because of an error, and never corrupted any data but you can never be too safe with your production database. There are occasionally very public stories of companies basically dying when their database fails and their backups are weak or nonexistant.
PostgreSQL is a sophisticated piece of software. It has some utterly amazing engineering behind it but also a lot of complexity. Hopefully this post can help someone get started that is having trouble finding concrete information. If you are setting up a new PostgreSQL server you want to make good hardware choices, install the very latest version your package manager allows, spend some time tuning the configuration, and setup good maintenance and backup scripts that run at regular intervals.
At some point I will follow up this post with two more – one on query and index tuning (the awesomely fun part in my opinion) and another on setting up and using PITR.