Postgres Migration, Bulk Data Loading, Updates vs. Upgrades, UUID Benchmarks

[00:00:00] In this episode of Scaling Postgres, we talk about a postgres migration bulk data loading, updates versus upgrades and UUID benchmarks. I'm Kristen Jameson and this is Scaling Postgres, episode 144. [00:00:23] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is Seamless MongoDB to PostgreSQL migration. This is from the Coinbase [email protected] and basically a lot of their major transactions that they track within Coinbase, they migrated from MongoDB to postgres. And it looked like this happened in 2019. And they go through the process that they used to do it. Now they wanted to do it very accurately because it involves transactions, but they also wanted to do a lot of iteration. So they made it a very repeatable process for what they were testing out. And this is the general process that they use. So they had their legacy system and the clients against it. So then what they chose to do is continue to read and write from this legacy database, but then do parallel writes to the new postgres system. And then during this process that they are parallel writing between essentially phase two and phase three, they backfilled the data. In the phase two stage, they were only reading from the legacy, but they were writing to both. In this stage, they were still writing to both, but reading from the new database system. And then finally, phase four indicates the cutover process. Now they described their repeatable backfill process. Basically, they were taking MongoDB, sending files to S three, then importing them into the postgres RDS system from there. And because it took a while to do the backload, they actually did it into temporary tables and recreated all of the indexes. And then at the cutover process, they used a transaction to essentially rename the tables. Now they also mentioned they did use triggers that tracked the insert, updates and deletes to keep the data in sync between the live and the backfill tables. So that's something else that they did. And they had this list of the process they went through to do the backfill and transfer process. And they said the automated process to do this data transfer took about four to 6 hours. But this was done without any downtime of their system at all, the process that they worked out here. So if you're interested in checking out how they did this process, definitely check out this blog post because I found it very insightful. [00:02:37] The next piece of content is bulk loading into PostgreSQL options and comparison. This is from Higo CA, and they compared a couple of different processes of bulk loading data. They described the system they're using here, and they used a CSV file that had 5 million rows, 14 columns, that was about 624 megabytes in size. Then they show you essentially the schema here. So the first thing they used, of course, was the copy command. They also tried using the client copy command that's a part of a psql. They tried using a file foreign data wrapper to do it as well as a PG bulk load tool. Now, I haven't used this tool, but it says it skips the shared buffers and wall logging process to be able to upload data fast and here are essentially the results. This is the method used here, the copy client copy foreign data wrapper, the bulk load and the top of the table is a standard table without an index. Now, as you can see here, the bulk load command wins out in every instance. So bypassing the shared buffers and the wall is a very efficient way of loading data if you need the most efficiency. But I would still probably resort to just using the general copy command myself, particularly if you care about your data and you want it to be able to handle any sort of failures. Now, they also tried this with an unlock table and interestingly it shaved a little bit off of the times from the copy commands and the foreign data wrapper, but it actually was a little bit longer on the bulk load so it didn't make much of a difference. And then he compared what it's like to load data with even just one simple index. I believe it was on the country column and you can see suddenly it's over five times slower and even in an unlocked with one index. Still, it wasn't an appreciable difference, but the well known process of removing indexes from a table, loading the data and then applying the index, that was around 25 seconds, which was double the time of without an index. But it's less than half the time of leaving the index in place. So this well accepted standard of dropping indexes, loading the data and then reapplying the indexes is still the fastest way even from this analysis. But it looks like that if you want the most performant way to load data, it is this bulk load that bypasses the shared buffers and the wall logging. So, definitely an interesting blog post and if you want to find more about the numbers and the process used, definitely check out this post from Higo CA. [00:05:12] The next piece of content is upgrading and updating postgresq wall. So they're making a distinction between upgrades and updates. So an update they're classifying as moving to the next point release. And to do that, all you have to do is upgrade your binaries. You don't need to alter the data files at all, so you don't need to do a literal upgrade process. What they're calling an upgrade is where you're moving to a major version release. So you're going from say 9.5 to 9.6 to ten 1112 or 13. Then you need to go through an upgrade process. The simplest ones just do a PG dump and then reload the data. But on a large database that takes a long time. The next technique you can use is a PG upgrade and that basically copies the data, as they say here, on a binary level. But if you use the hard link process that you're able to keep all the data files in place and the upgrades happen very fast, like a terabyte multi terabyte database can be done on the order of seconds. Now, one thing they did mention here is they said, quote, what is important to note here is that PG upgrade is never destructive. If things go wrong, you can always delete the new data directory and start from scratch. But there is a caveat with the Link option because if you're using Link option, you essentially don't have a copy of the data files. And what the postgres documentation says is if you use the link mode for the PG upgrade, the upgrade will be much faster, use less disk space, but you will not be able to access your old cluster once you start the new cluster after the upgrade. So things definitely change and you can't go back. Generally I do use the link mode when I do PG upgrades, but I always have a replica in place. So essentially I'm upgrading that replica or I upgrade the master and keep the replica offline as a backup to be able to fall back if I need to. And then the rest of the post gives an example of going through the process of doing an upgrade using PG upgrade. So if you're interested in that, you can check out this post from Cybertechgresql.com. [00:07:15] The next piece of content is benchmark version four UUID generation in Postgres. And they're describing two different functions that exist for generating UUIDs in postgres. One is the UUID generate version four which is available via the UUID OSSP extension, but as of version 13, there is a gen random UUID function that's built in. So we actually did a test of this actually using version twelve of Postgres because he actually used the PG crypto extension and he determined how fast you could generate 10,000 UUIDs across a set of different hardware and different platforms. And what he found is much better performance for the gen random UUID function for generating UUIDs. So definitely you're going to want to reach for that, and particularly since it's built into postgres 13, that makes sense. But in particular the Windows server was awfully slow with the UUID generate version four function. So if you run Postgres on Windows, you definitely want to be using this gen random UUID. So if you want to learn more about this, feel free to check out this blog post from sushan info. The next piece of content is using Plpg SQL to calculate new postgres columns. This is from Crunchydata.com and they're talking about analyzing information from a data science perspective and that generally they want to calculate zscores right next to the data that they're analyzing. And to do this, he actually generated a plpgsql function to generate those extra columns containing those zscores. So he has a link to the full function on GitHub right here, but then he walks through each portion of it. So he has a function where he determines, okay, what schema we're going to be changing, what table name, what prefix to append to the beginning of the column as well as an array of columns that he wants to do the calculations for, as well as an indicator of what the primary key is, to be able to update the data easily. So the function does a for each loop through those array of column names, it generates what the new column name should be, alters the table to add that new column as a numeric and then populates that column using this function that contains two CTEs to do the calculations for the zscores. [00:09:36] So if you're interested in using functions to have this ability to assign zscores and do calculations with particular tables, definitely check out this blog post from Crunchydata.com. [00:09:48] The next piece of content is what's the special for logical level of PostgreSQL? Wall. So this post is asking the question of when you have a wall level of replica and then move it to wall level of logical. It says it records more information. Well, what is that? More information? That's what this blog post covers. Now predominantly it talks about differences with what it records, with whether the wall record is a full page image or not. And full page images have to happen after a checkpoint to make sure it has all the necessary information to do a recovery. So if you're interested in determining the differences, you can check out this post from Heigo CA. [00:10:30] The next piece of content is waiting for PostGIS three one, GEOS three Nine. And this is from Crunchydata.com. And it's talking about PostGIS and how it has additional libraries it works with. And the one that they're mentioning here is GEOS that have some enhancements that have been added specifically to the overlay engine function that allows for greater capabilities and more performance. And they're mentioning two times faster than the original. So if you use PostGIS and specifically GEOS, you may be interested in this content. [00:11:04] The next piece of content is hands on with OSM two S, PG SQL's new Flex output. So this is talking about the OpenStreetMap data and being able to get it into postgres. And apparently it has a new output that can serve as an input for loading into postgres and it has a lot more flexibility and you can use JSON B fields. So if you work with the OpenStreetMap data, perhaps you want to check out this post from Rustprooflabs.com. [00:11:34] The last piece of content is the PostgreSQL Person of the Week is Carol Arnold. So if you're interested in Carol and her contributions to postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show. Notes be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks. [00:12:04] Our channel our.

Show Notes

Episode Transcript

Other Episodes

Episode 60

Pluggable Storage, Developer Gotchas, Cursors, PG12 | Scaling Postgres 60

Episode 272

Postgres 16 Beta 2, The Rise of Vectors, FDW Performance, Unused Indexes | Scaling Postgres 272

Episode 118

Safety Systems, Failover Slots, Transaction ID Complications, Repartitioning | Scaling Postgres 118

Postgres Migration, Bulk Data Loading, Updates vs. Upgrades, UUID Benchmarks | Scaling Postgres 144