Here’s a dozen tips for working with a PostgreSQL database. It is a sophisticated and powerful piece of software and just knowing a few rules of thumb before diving in can be a huge help. If you want more detail read the amazing documention. My list of tips was very long so I just chopped off a dozen for this post.
#1: Don’t do sequential scans – use indexes
Do SELECTs against indexes. Sequential scans will devastate your IO and in most cases should be avoided. I read on a bogus guide somewhere that needing indexes was a sign of a bad database schema. That is complete BS.
The only time indexes can hurt is when doing bulk inserts. I did a post about dropping/recreating to temporarily boost bulk insert performance awhile back.
Most of the time it doesn’t matter though. Throw down as many indexes as you need.
#2: Index all foreign keys
There are certain rare cases where this doesn’t help, but as your database grows in size this becomes increasingly important. Say you have a “people” table. You then have an “emails” table with a person_id column that is a FK to the people table. As an example say you have 10 million rows in the emails table.
Now delete a row from that people table. PostgreSQL must scan the emails table and ensure that no email rows reference the person row you just deleted. If that person_id column on emails is not indexed you just kicked off a sequential scan of 10 million rows with your deletion attempt. Index that column and the pain goes away. My general rule of thumb is to index every single FK. You can always drop the indexes later if they aren’t needed.
Here’s a handy query for discovering all the FKs pointed at a table in a database. You can then shoot through the results and verify that each table has indexes for those FKs.
select t.constraint_name, t.table_name, t.constraint_type,
c.table_name, c.column_name
from information_schema.table_constraints t,
information_schema.constraint_column_usage c
where t.constraint_name = c.constraint_name
and t.constraint_type = 'FOREIGN KEY'
and c.table_name = 'MY_TABLE_NAME';
#3: Don’t be afraid to selectively de-normalize
It is only a matter of time before you realize the normal forms you learned about in school can seriously hurt your database performance as size gets huge. De-normalize (usually meaning duplicate a value in two tables) when it provides a big performance increase. You must be mindful of the maintenance concerns with de-normalization. The maintenance can be done in code, with triggers or similar in the DB, or sometimes the data is static enough that it isn’t a huge concern.
Edit: I hedged this advice from the original post (originally was a blanket “de-normalize all the time” that could burn someone new to databases). You should know the normal forms and understand them. My point was that you shouldn’t feel that wavering from them with de-normalization makes your database a bad one – it is simply a reality of nontrivial production databases. But, if you don’t understand them, and/or don’t understand when and why de-normalization can help blindly ignoring the normal forms is only going to get you into more trouble. So learn them and understand them but then don’t be afraid to selectively de-normalize where it can give you big boosts of performance. A common scenario that is often very helpful is de-normalizing to eliminate a join or multiple reads.
#4: Avoid joins on huge tables
As your tables get bigger and bigger (many millions of rows) joins cannot be done in memory and spill over to disk. Once that happens your performance is gone. For smaller tables it doesn’t matter but as tables get bigger avoid joining when you can. Try using unions, subselects, or denormalization instead. If you have two big tables that are 1-to-1 that you are joining a lot combine them into one wide table.
#5: Don’t do unanchored text searches or full text searching
That is don’t try something like this with wildcards on both ends:
select * from people where first_name like '%joe%'
The reason is that no index can be used so PostgreSQL will be doing a sequential scan no matter what. If you truly need full text searching use Lucene or PostgreSQL 8.3+ full text searching capabilities instead of standard indexes.
If you are in a pinch and can get by with the left side being anchored you can create a special index using varchar_pattern_ops like this:
create index idx_people_first_name on people (first_name varchar_pattern_ops);
That index can be used by this query (no wildcard on left side):
select * from people where first_name like 'joe%';
But be aware that same index cannot be used by a query using the standard “=” operator so you have to create two indexes to get both (one with varchar_pattern_ops, one without). Yes this is messy and its why you should use Lucene, PostgreSQL 8.3+ full text searching, or something similar to do text searching.
#6: Use the COPY command for bulk data loading
It is orders of magnitude faster. Check a previous post of mine if you want numbers.
#7: Use multi-column indexes
For common queries with compound WHERE clauses create a multi-column index for maximum performance. An example:
create index idx_people_full_name on people (first_name, last_name);
Now you can do queries like this and hit that index:
select * from people where first_name = 'joe' and last_name= 'smith';
Additionally, any WHERE clause using just first_name can take full advantage of this index (because first_name is the first column) so don’t waste disk space creating redundant single column indexes if you are already covered by a multi-column.
Finally, you can align a multi-column index with a common ORDER BY clause to make that query instant. Using the above index a query against people ending in order by first_name, last_name would be super fast.
As a final note PostgreSQL can combine multiple single-column indexes so multi-column indexes on every combination are not necessary. But, a single multi-column is going to be faster so they can be helpful for queries you run a lot. You can read more about combining of multiple indexes here.
#8: Don’t bother indexing evenly distributed booleans or enum columns
In PostgreSQL indexes do not store data, they are instead pointers to the data. This means that after scanning the index a random read must be done for each matching record to get the data. Random IO is a lot more expensive than sequential IO (though SSDs are changing that) so if the PostgreSQL planner (it does sampling of your data in the background) knows you have a very evenly distributed column (a boolean split 50/50 for instance) it won’t use the index at all. It will instead make the assumption that a straight up sequential scan will be faster than index scanning for half the rows and having to do a ton of random IO. Indexes on these types of columns are a waste of disk space and you are slowing down your inserts for no benefit.
#9: Use EXPLAIN ANALYZE to benchmark and compare queries
Using EXPLAIN on its own will give you the planner’s guess at how it will run your query. This is pretty useful if you have a very huge query that is going to take a long time and you want to tune it without having to actually run it after each tweak. The complete EXPLAIN ANALYZE will run the query, show the plan predicted, show what actually happened, and how long it took.
Be sure to ANALYZE after significant table modifications so that the planner can do its sampling and adjust its plans to account for any index changes.
#10: Use expression indexes – they are awesome
In PostgreSQL you can index an expression. Here is a simple example:
create index idx_people_lower_first_name on people (lower(first_name));
Now you can run queries like this and use the index – no need to create a separate column with lowercase values:
select * from people where lower(first_name) = 'joe';
You can similarly index combinations of your data-containing columns:
create index idx_people_lower_full_name on people (lower(coalesce(first_name, '')) || lower(coalesce(last_name, '')));
#11: Indexes can have WHERE clauses
You can create conditional indexes that only apply for certain rows. A useful scenario is creating a unique constraint on a column or set of columns only where certain conditions apply:
create unique index idx_people_person_id on people (person_id) where deleted = false;
That’s a simple example that feels a bit cleaner than changing the ids of deleted records to avoid conflicting with a unique constraint. Anything can be placed in that WHERE clause, they aren’t limited in any way.
#12: Don’t use the database if it isn’t a good fit
Databases only scale so far alone. At some point you are going to have to explore memcached, terracotta, or something similar to continue to scale. The NoSQL (over)excitement going on right now offers alternatives as well for certain scenarios. The main point is that at some point your database will just be one component in a set of services and technologies that provide the persistence layer for your application.