PostgreSQL Tips and Tricks
Here’s a dozen tips for working with a PostgreSQL database. It is a sophisticated and powerful piece of software and just knowing a few rules of thumb before diving in can be a huge help. If you want more detail read the amazing documention. My list of tips was very long so I just chopped off a dozen for this post.
#1: Don’t do sequential scans – use indexes
Do SELECTs against indexes. Sequential scans will devastate your IO and in most cases should be avoided. I read on a bogus guide somewhere that needing indexes was a sign of a bad database schema. That is complete BS.
The only time indexes can hurt is when doing bulk inserts. I did a post about dropping/recreating to temporarily boost bulk insert performance awhile back.
Most of the time it doesn’t matter though. Throw down as many indexes as you need.
#2: Index all foreign keys
There are certain rare cases where this doesn’t help, but as your database grows in size this becomes increasingly important. Say you have a “people” table. You then have an “emails” table with a person_id column that is a FK to the people table. As an example say you have 10 million rows in the emails table.
Now delete a row from that people table. PostgreSQL must scan the emails table and ensure that no email rows reference the person row you just deleted. If that person_id column on emails is not indexed you just kicked off a sequential scan of 10 million rows with your deletion attempt. Index that column and the pain goes away. My general rule of thumb is to index every single FK. You can always drop the indexes later if they aren’t needed.
Here’s a handy query for discovering all the FKs pointed at a table in a database. You can then shoot through the results and verify that each table has indexes for those FKs.
select t.constraint_name, t.table_name, t.constraint_type, c.table_name, c.column_name from information_schema.table_constraints t, information_schema.constraint_column_usage c where t.constraint_name = c.constraint_name and t.constraint_type = 'FOREIGN KEY' and c.table_name = 'MY_TABLE_NAME';
#3: Don’t be afraid to selectively de-normalize
It is only a matter of time before you realize the normal forms you learned about in school can seriously hurt your database performance as size gets huge. De-normalize (usually meaning duplicate a value in two tables) when it provides a big performance increase. You must be mindful of the maintenance concerns with de-normalization. The maintenance can be done in code, with triggers or similar in the DB, or sometimes the data is static enough that it isn’t a huge concern.
Edit: I hedged this advice from the original post (originally was a blanket “de-normalize all the time” that could burn someone new to databases). You should know the normal forms and understand them. My point was that you shouldn’t feel that wavering from them with de-normalization makes your database a bad one – it is simply a reality of nontrivial production databases. But, if you don’t understand them, and/or don’t understand when and why de-normalization can help blindly ignoring the normal forms is only going to get you into more trouble. So learn them and understand them but then don’t be afraid to selectively de-normalize where it can give you big boosts of performance. A common scenario that is often very helpful is de-normalizing to eliminate a join or multiple reads.
#4: Avoid joins on huge tables
As your tables get bigger and bigger (many millions of rows) joins cannot be done in memory and spill over to disk. Once that happens your performance is gone. For smaller tables it doesn’t matter but as tables get bigger avoid joining when you can. Try using unions, subselects, or denormalization instead. If you have two big tables that are 1-to-1 that you are joining a lot combine them into one wide table.
#5: Don’t do unanchored text searches or full text searching
That is don’t try something like this with wildcards on both ends:
select * from people where first_name like '%joe%'
The reason is that no index can be used so PostgreSQL will be doing a sequential scan no matter what. If you truly need full text searching use Lucene or PostgreSQL 8.3+ full text searching capabilities instead of standard indexes.
If you are in a pinch and can get by with the left side being anchored you can create a special index using varchar_pattern_ops like this:
create index idx_people_first_name on people (first_name varchar_pattern_ops);
That index can be used by this query (no wildcard on left side):
select * from people where first_name like 'joe%';
But be aware that same index cannot be used by a query using the standard “=” operator so you have to create two indexes to get both (one with varchar_pattern_ops, one without). Yes this is messy and its why you should use Lucene, PostgreSQL 8.3+ full text searching, or something similar to do text searching.
#6: Use the COPY command for bulk data loading
It is orders of magnitude faster. Check a previous post of mine if you want numbers.
#7: Use multi-column indexes
For common queries with compound WHERE clauses create a multi-column index for maximum performance. An example:
create index idx_people_full_name on people (first_name, last_name);
Now you can do queries like this and hit that index:
select * from people where first_name = 'joe' and last_name= 'smith';
Additionally, any WHERE clause using just first_name can take full advantage of this index (because first_name is the first column) so don’t waste disk space creating redundant single column indexes if you are already covered by a multi-column.
Finally, you can align a multi-column index with a common ORDER BY clause to make that query instant. Using the above index a query against people ending in order by first_name, last_name would be super fast.
As a final note PostgreSQL can combine multiple single-column indexes so multi-column indexes on every combination are not necessary. But, a single multi-column is going to be faster so they can be helpful for queries you run a lot. You can read more about combining of multiple indexes here.
#8: Don’t bother indexing evenly distributed booleans or enum columns
In PostgreSQL indexes do not store data, they are instead pointers to the data. This means that after scanning the index a random read must be done for each matching record to get the data. Random IO is a lot more expensive than sequential IO (though SSDs are changing that) so if the PostgreSQL planner (it does sampling of your data in the background) knows you have a very evenly distributed column (a boolean split 50/50 for instance) it won’t use the index at all. It will instead make the assumption that a straight up sequential scan will be faster than index scanning for half the rows and having to do a ton of random IO. Indexes on these types of columns are a waste of disk space and you are slowing down your inserts for no benefit.
#9: Use EXPLAIN ANALYZE to benchmark and compare queries
Using EXPLAIN on its own will give you the planner’s guess at how it will run your query. This is pretty useful if you have a very huge query that is going to take a long time and you want to tune it without having to actually run it after each tweak. The complete EXPLAIN ANALYZE will run the query, show the plan predicted, show what actually happened, and how long it took.
Be sure to ANALYZE after significant table modifications so that the planner can do its sampling and adjust its plans to account for any index changes.
#10: Use expression indexes – they are awesome
In PostgreSQL you can index an expression. Here is a simple example:
create index idx_people_lower_first_name on people (lower(first_name));
Now you can run queries like this and use the index – no need to create a separate column with lowercase values:
select * from people where lower(first_name) = 'joe';
You can similarly index combinations of your data-containing columns:
create index idx_people_lower_full_name on people (lower(coalesce(first_name, '')) || lower(coalesce(last_name, '')));
#11: Indexes can have WHERE clauses
You can create conditional indexes that only apply for certain rows. A useful scenario is creating a unique constraint on a column or set of columns only where certain conditions apply:
create unique index idx_people_person_id on people (person_id) where deleted = false;
That’s a simple example that feels a bit cleaner than changing the ids of deleted records to avoid conflicting with a unique constraint. Anything can be placed in that WHERE clause, they aren’t limited in any way.
#12: Don’t use the database if it isn’t a good fit
Databases only scale so far alone. At some point you are going to have to explore memcached, terracotta, or something similar to continue to scale. The NoSQL (over)excitement going on right now offers alternatives as well for certain scenarios. The main point is that at some point your database will just be one component in a set of services and technologies that provide the persistence layer for your application.
Trackbacks and Pingbacks
- PostgreSQL Tips and Tricks - DbRunas
- PostgreSQL Tips and Tricks | gtuhl: startup technology | Xmas Tips
- popurls.com // popular today
- PostgreSQL Tips and Tricks | gtuhl: startup technology « Netcrema - creme de la social news via digg + delicious + stumpleupon + reddit
- PostgreSQL Tips and Tricks | gtuhl: startup technology - Local Tech Experts
- PostreSQL Tips & Tricks
- Twitted by webstartupgroup
- PostgreSQL Tips and Tricks | gtuhl: startup technology
- PostgreSQL Tips and Tricks | gtuhl: startup technology « New Emerging Technologies for Netbook Mobile Phones Laptops Gadget and Gizmos
- Twitted by elchiva_twt
- PostgreSQL Tips and Tricks | giswhat.be
- Twitted by ryszard99
- Benjamin A. Shelton | Blog » Blog Archive » Links: August 12th
- Closer To The Ideal » Blog Archive » JOINs are evil, part II
Comments are closed.
“The only time indexes can hurt is when doing bulk inserts. I did a post about dropping/recreating to temporarily boost bulk insert performance awhile back.
Most of the time it doesn’t matter though. Throw down as many indexes as you need.”
I’ve seen plenty of times when having too many indexes mattered quite significantly. An index is effectively another copy of the data, and must be updated whenever the underlying data are updated. If having more indexes didn’t actually matter, PostgreSQL features like Heap-Only Tuples wouldn’t have any significant performance impact, which clearly isn’t the case.
If your database is extremely, massively write heavy then sure the balance may tip more towards too many indexes being worse. But that is barely ever the case. Indexes must indeed be updated every time data gets updated (insert, update, or delete) so you absolutely pay a penalty on every write. Note though that indexes are NOT a copy of the data – they do not contain data. They are pointers, and its just the structure of the index that must be updated.
I am willing to bet in 99% of the cases throwing down as many indexes as you need is the right way to go. I am not saying to just create indexes for no reason – just create one for every common SELECT. If you have a use case or project that falls in that 1% of super heavy writes would love to hear about it (honestly, love hearing about unique setups and problems).
As an example our OLTP database handles a lot of writes (millions a day) but in the overall picture writes are less than 5% of our queries so it simply doesn’t make sense to punish the other 95%. If we have to do a lot of writing for a big migration or to bring a new customer’s lists online we drop the indexes on the relevant partitions and then recreate them when done. Single record inserts caused by your application often aren’t going to feel any slower because the code-level stuff is going to hog most of the perceived time anyway.
I strongly agree with the comment that inspired the edit to denormalizing (tho I’m not sure how to see it, I only seem to see the last two comments). Any way … I am also a strong proponent of duplicated data to speed up query results.
The conceptual understanding that needs to be applied is the concept of a primary data record and all copies as being permutations of that. As long as you can at any time perform the duplication logic from primary data record to produce copies and get what you expect, then you are in safe heaven. You never update the copies except from the duplication logic.
Thanks for the comments, seems we are on the same page with respect to normalization, like your explanation of it. That edit was inspired by the comments on hacker news, this post picked up some steam there. The normalization and join advice above have together drawn the most disagreement and that was expected. The join advice too could use some moderation as many read that as “never do joins” when I really more precisely meant to communicate something like “don’t do joins between tables with lots of rows” where lots of rows means many many millions on the low end.
Hi
I’m no PostgreSQL guy, what means I don’t know about the truth of all those tips for PostgreSQL. Nevertheless, I know other database systems and most tips sound good and logical to me.
Anyway, are #3 and #4 are really true for PostgreSQL? If yes, for which kind of database? De-normalization is a technique to design selective reporting and warehousing databases, yes. De-normalization is a no-go for an OLTP database.
These “tips” are fine for a warehouse database which is filled by a small count of ETL jobs or very simple systems with only one front end application. My experiences showed me, data redundancy in a large operational database always(!) ends up in data inconsistency. To keep a de-normalized database in sync within a enterprise system is way too complex and will (not might) fail.
Again, all other tips sound fine to me and are even generally correct to all database systems.
Greets
Flo
Hey Florian, thanks for commenting. I’ve got a lot of experience with PostgreSQL, a decent amount with MySQL, and a much smaller amount with Oracle and MSSQL. I believe both points are still valid across all of those systems.
On #3 (and #4) its all about moderation and making informed decisions based on specific load. My initial wording was pretty slanted and typed in the context and bias of some work I was dealing with in our DB at the time. I hedged it a long while back. De-normalization can get you massive performance wins in a very large database by preventing joins or reducing query counts on SELECTs. The maintenance side of de-normalization can indeed be painful so you have to carefully consider each usage. Sometimes it doesn’t matter (very static data), sometimes you can have the DB handle it automatically with triggers or similar, and occasionally handling it in code isn’t too bad.
I finally did adjust the wording on #4 just now to prevent further confusion. My point is joins don’t work when your tables are HUGE. We run OLTP loads (big social networks specifically) with hundreds of thousands of users and at peak times thousands of transactions a second on a single decent Dell server. We have lots of tables and partitions that are in the 30million+ row range. Databases handle joins in memory and unless you have the RAM to set work_mem (or non-Postgres equivalent) very high they get done on disk for joining of very large tables. Once that happens they run at crippling speed. Joins are wicked fast when they fit in memory and as a general rule of thumb are the way to go (the PostgreSQL planner will even convert queries to joins under the hood when it makes sense) unless you are working with massive tables and result sets.
So I think both points are valid, but my initial language was far too broad. They make sense when you have a lot of data and are still using an RDBMS to handle it. Partitioning can help delay the usage of those techniques by keeping the tables sizes smaller but I find it hard that you could avoid both in a very large system.
Thanks again for commenting and perhaps point those fellas in that sqlservercentral forum at this response so they have some context before assuming I don’t know what I am talking about