Bulk Data Loading With PostgreSQL
Bulk data loading is usually a niche task – something that doesn’t have to be done too often but when it does it feels painful because it takes so long and if you mess something up it has to be done again. After loading dozens of 7 and 8 digit record count files into PostgreSQL databases I think it would be helpful to at least share the basic stuff that can make the process go a lot faster. The PostgreSQL documentation of course has everything you need for bulk data loading if you would rather read the official docs here but I consistently hear cases where the advice there isn’t used so I wanted to show some numbers. That link covers more techniques, all this post really adds is some numbers to back up the basics.
For this post I am working with 600k records being loaded into a table that looks like this. The data plus delimiters is about 60MB.
Table "public.people" Column | Type | Modifiers -------------+-----------------------------+----------- id | character varying(36) | not null first_name | character varying(255) | last_name | character varying(255) | middle_name | character varying(255) | client_id | character varying(36) | birth_date | timestamp without time zone | Indexes: "people_pkey" PRIMARY KEY, btree (id) "idx_people_client_id" btree (client_id) "idx_people_name_search" gin (to_tsvector('english'::regconfig, (((COALESCE(first_name, ''::character varying)::text || ' '::text) || COALESCE(middle_name, ''::character varying)::text) || ' '::text) || COALESCE(last_name, ''::character varying)::text))
A few points on this particular table and config:
- The data is a subset of a real production table made up of entries people provided on themselves so the distribution is realistic and balanced.
- The primary key id column is a UUID generated by code.
- A very basic index on the client_id column.
- A more complex expression gin index on the name fields to allow full text searching. Full text searching was added in PostgreSQL 8.3 and it’s pretty powerful. Read more about that here if interested.
- Overall it is a relatively small, simple table but not trivially so and its large enough to get decent timings from bulk load operations for comparison purposes.
- Performing tests against a clean DB, empty table, and freshly created indexes so we have a best case scenario.
So, let’s say you have that empty table and want to load 600k records in. Here is a sequence of methods going from worst to best. I am doing a VACUUM ANALZYE before and after each operation. All timings rounded up to nearest second. We are comparing INSERT and COPY.
First Setup: leave indexes and pkey in place.
INSERT: 6 minutes, 5 seconds
COPY: 3 minutes, 11 seconds.
Second Setup: drop/recreate indexes, don’t touch pkey
When timed it took 1 second to drop indexes and 13 seconds to rebuild them. I have added that overhead into these numbers.
INSERT: 3 minutes, 19 seconds
COPY: 1 minute, 5 seconds
Third Setup: drop/recreate indexes and pkey
When timed it took 1 second to drop indexes and 15 seconds to rebuild them (includes pkey). I have added that overhead into these numbers.
INSERT: 2 minutes, 49 seconds
COPY: 21 seconds
With that final setup, dropping/recreating all indexes and using the COPY command, I was able to load 600k records in 21 seconds. That includes the drop/recreate overhead. The actual COPY command only took 5 seconds.
General notes related to the tests done above:
- The COPY command is magical. It completely blows away INSERT for bulk load performance.
- Dropping and recreating indexes is far faster than leaving indexes in place and forcing thousands or millions of individual updates to them.
- Do a VACUUM ANALYZE before and after bulk loading data to be safe. If you dropped indexes be sure to recreate them before doing the ANALYZE.
- If you don’t think the speed improvement shown is significant keep in mind we are working with a narrow 6 column table, with just 3 indexes, and not very much data. The savings found by dropping and recreating indexes becomes enormous as your table height and width increases and/or your number of indexes goes up. Similarly if you are only working with thousands or low tens of thousands of rows you can just do whatever is easiest since the size is so trivial.
Some other notes worth mentioning:
- The same rules apply for updates. ALL indexes slow down updates, not just the ones on the column you are updating so if you need to update an entire table dropping and recreating the indexes will provide a massive speed increase.
- If you are doing a bulk update and using a field on the updated table for lookups you are generally better off dropping all indexes except the one on your lookup column. Say for each person in my people table I am looking up the specific person by id and setting their client_id. I would want to leave the index/pkey on the id column so those lookups were fast. You still pay a penalty at write time because the index/pkey must be updated but not having the index would mean a sequential scan is done on the entire table for each update.
- Though incredibly fast the COPY command is very picky about its input. It is an all or nothing operation and if there is even a slight issue with your data it will fail. You will want to VACUUM your table after a failed COPY to recover space. Thankfully if it does fail it prints out the exact line number and reason.
- When specifying a file for the COPY command you must use an absolute path and the file (and all directories in its path) must be accessible by the postgres user because it is the server that goes and reads the file not the client. I find it easiest to just toss my input files into /tmp.
- Consider partitioning your data to allow drop/add of indexes on certain partitions preventing an entire table from having awful read performance while data is being loaded. Partitioning is a whole separate topic but read the docs, be sure constraint_exclusion is on in your postgresql.conf, and be sure your check constraint is in the where clause of all queries if you take this route. Partitions and significant bulk data loading fit together really really well.
- Seriously read the PostgreSQL documentation on this stuff if you are bulk loading data. Aside from the additional SQL-level recommendations it covers the tuning options for your DB that can make a very significant difference.
Finally, you may find that expression index for full text searching in my people table above looks strange. I hope to do a post on indexing specifically that explains in detail. In my opinion tuning database indexes is a blast and made even greater by the amazing capabilities of PostgreSQL.