Partition Migration, Like Performance, Best Fillfactor, In-Memory Tables

[00:00:00] In this episode of Scaling Postgres, we talk about partition migration, like performance, best fill factor, and in memory tables. I'm Kristen Jameson, and this is scaling postgres episode 124. [00:00:23] I hope you, your family, Lane coworkers continue to do well. Our first piece of content is partitioning a large table without a long running lock. This is from secondquader.com. And he's talking about an issue where you have a large table in a database and it continues to grow in size and grow in size. And finally you realize, gee, I really should have partitioned this or I need to partition it. Now, typically what you would have to do is do this offline or somehow make the table unavailable while you do this. But this blog post describes a technique you can use to actually partition this table while keeping it online. And so the first thing they do, they created a table they just call original table. They put some indexes on it and inserted about 100,000 rows into it, although here it says 10 million rows. Well, some number of rows. And then they go ahead and create the partition structure. They create a table called a part table for the parent table. They partition it by range and create four partition tables that are the child tables. Now, how they're going to keep this up to date between the partition table and the original table is using a trigger. So this is the trigger that they're going to create to do this. So whenever there's an insert on this table, it's going to insert what the new data is into that partition table. When there's a delete, it's going to delete it from the partition table and delete it from the original table. And if there is an update or in this else statement, it's going to go ahead and delete that value from the original table. And if it's found when it's deleted, it's going to insert that data because this is an update, it's going to insert essentially that new data because it hasn't been found yet in the new table with the new data. However, if it already exists, it's going to go ahead and update that data in the partition table. So by using this trigger, everything can be kept in sync. So to implement this, it all happens as part of this transaction here. [00:02:28] The first thing that they do is they're going to alter the table name to this old original table name. Basically, they're just renaming what the table name is. So this should execute pretty quickly this transaction. So just renaming the original table to an old table, and then they're going to alter the table to disable vacuum settings. And then they're going to create a view with the same name of the table that it was renamed from original Table. It's going to select all the data from the original table and union it with the data from the new partition table. So basically, this view can give you the full state of the data that exists across the partition table and the original table. And then with the function that was mentioned up above, they create a trigger out of it and apply it for each row to the original table. Now, once this is done, things will be kept in sync and data will slowly move from that old table to the new partitioning structure. And then what you do is do an update on that table, essentially the view for each ID that exists in the old tables, and it will slowly migrate that data from the old non partition table to the new partitioning scheme. And they just have a script in perl that would do that. So you say they did like 10,000 rows at a time to migrate the old data over and then once the old table is finished, there are some cleanup commands that can be done here, basically dropping the view and the function altering sequence names and the table to rename it back to the original table name. So this is a really interesting technique and I think I might actually be using this in the future. So if you want to learn more about this technique, definitely check out this blog post from Secondquadron.com. [00:04:13] The next piece of content is PostgreSQL more performance for like and I like statements. This is from Cybertechgresql.com and it's talking about like performance. So they create a table that is basically filled with a hash and they did it for about 50 million rows. And then they did a query looking for a string that looked like this. And they knew that there was only one row that had this string and they're looking it for anywhere within this column. And searching for that one row from 50 million rows took a little over 4 seconds, which is quite a long time. Now, doing an explain plan, it shows it's doing a parallel sequential scan, so it has to scan through all 50 million rows to find this. So then the question is, how can you improve performance? And so what he did is he used the PG Trigram extension. So it breaks up that column into Trigrams and he kind of shows how it gets broken up into Trigrams when you run it against it. And then the next step is to index it. Now, first he used a Gist index, but that was huge, 2.5 times the size of the actual table and the performance was really bad, over a minute and 45 seconds. So it's longer doing a search through this just index than just searching on the table. So clearly you would not want to use a just index. And most text searches that I've seen use gen indexes. So that was the next thing to do is to use a gen index. And with this gen index, searching for that one row, using that like query finished in 75 milliseconds, so much faster than say, 4.7 seconds. But the thing to keep in mind is that that is not too efficient when it comes to looking for an exact match on a column. So if you also needed the ability to do an exact match, you would want to also add a B tree index for this column because if you add a B tree index for this column and pull out a single row, it returns in less than one millisecond. So it's just something to keep in mind when you're using text searching. A lot of times the gen index can be much more efficient for things of that nature, like full text search or even using JSON data. So if you're interested in checking out the performance improvements you use for like, and I like definitely check out this blog post, the next piece of content is what is fill factor and how does it affect PostgreSQL performance? So fill factor is how full you're going to make a table or index and leave space on the page available for new data to be inserted. So when would that data be inserted? Basically when an update happens. So you're reserving some of that space so updates can happen there as opposed to splitting out, creating a new page for it. Now by default, the fill factor has 100%. So the question is, is it advantageous to drop down the fill factor? Now if you never have updates to a table, a fill factor for 100% is basically where you want to go. But if you have a lot of updates, maybe you want to drop that fill factor. Now he talked about the test setup he used and he wanted to run some tests to see what he could achieve. And he looked at fill factors of 190, 80 and 70. But because you are actually leaving some of the page empty when you're decreasing the fill factor, you're actually going to run into a set of diminishing returns because your caching becomes less efficient, because those pages are cached and if they're not as full, some of the performance is going to drop. And he looked at both transactions per second as well as the meantime to run the queries. And it looked like in his test, the sweet spot for at least the sample size was a fill factor of 90% and maybe in some cases maybe drop it, you could drop it to 80%, but it's basically just a smaller fill factor could give you a little bit of performance boost. And I believe he said, quote, in average a 10% boost when decreasing the fill factor by ten or 20%. So that seems to be about the sweet spot. But again, because of these diminishing returns, you want to take that into account when adjusting this. So again, adjusting this value seems very specific to your use case, so you definitely want to test it out. But in the test he did here a relatively minor fill factor decreased seemed to work best for at least getting a little bit of a performance boost. So if you want to learn more and all the details about the test and the conclusions, definitely check out this post from CyberTech postgresql.com. [00:08:47] The next piece of content is approaches to achieve in memory table storage with PostgreSQL pluggable API. Now, there is not an in memory storage available for PostgreSQL, yet this post talks about working potentially towards that because with the release of PostgreSQL twelve, they now allow, quote, a custom table storage access methods to be developed. So if you've heard in previous episodes of Scaling Postgres, we talked about Zheep. It's a new storage infrastructure for storing data that could potentially eliminate the need for doing vacuum. So they're actually using this API to say, hey, can we store data in memory only? So this goes through the process of their thoughts behind it, what kind of changes need to be made to develop a new storage method? So they talk about a number of the questions as well as how to handle the buffer manager because the buffer manager manages what's in memory and flushing it out to disk. And this is utilized to a great extent in the other table storage methods that are being used, such as Zheep. But if you're going straight to memory, there's not really a reason to buffer it in memory. Why would you buffer it to memory to then save it to memory? So really you would just want to go to direct to memory. So it's essentially thinking about how to essentially bypass this or what makes the most sense. So really this is an exploratory post on how they're developing it. So it's basically a work in progress. So if you're interested in finding out more about it, definitely check out this post from Higo CA. [00:10:25] The next piece of content is Webinar being committed a review of transaction control statements one through three follow up. So this is from Secondquader.com and it's a webinar and it talks about transaction control statements. They talk about transactions and persistence of the data, how to handle application retries, and getting into a little bit about transactions and visibility. So this was, I would say, a basic presentation on transaction control statements. So if you're interested in that, go ahead and check out this webinar from Secondquader.com. [00:11:01] The next post is monitoring system activity with the new system stats extension for PostgreSQL. This is from the Enterprisedb.com blog and we had covered this extension in a previous episode of Scaling Postgres, posted I believe, on the personal blog of Dave Page. But this again covers the extension in detail and basically it exposes utilities that track CPU usage, disk usage, network activity, things of that nature through to the SQL interface. So it basically gives you these new system tables where you can query and use these utilities to get back detailed operating system information. So if you're interested in learning more about this, how to install it, and what kind of OS statistics you can get from it, check out this blog post from Enterprisedb.com. [00:11:54] The next post, also from Enterprisedb.com is maintaining PostgreSQL is more than just a maintenance plan. So this is a general post, but they talk about what are some common database maintenance tasks. So they run through four here, basically vacuum analyze every week, keeping your stats updated, reindex your heavily updated tables every month, plan for vacuum full events when you need to say shrink tables or whatnot. And also monitor your connections and your load. And they also have a framework for addressing maintenance on an ongoing basis. Talking about first examine what state your database is currently in, what system issues are you currently experiencing, are queries waiting for a lock, any kind of performance degradation? And then look at that and how to approach it to address some of these issues. So if you're looking for a bit of a framework on how to tackle ongoing maintenance issues of your PostgreSQL instance, you can check out this blog post from Enterprisedbay.com. [00:12:54] The next piece of content is Unicode Normalization in PostgreSQL 13. So basically Unicode Normalization is basically converting multiple different forms of a character to a consistent form. Now this post goes into more detail with regard to that, but it's basically normalizing a character that essentially looks the same to be the exact same Unicode character. And this option becomes available in PostgreSQL 13. So they go ahead and show you how you can do determine is NFC normalized or is NFD normalized? And basically NFC is the default, and they say that most of the world essentially uses NFC. So if you want to learn more about this Normalization feature for Unicode with regard to PostgreSQL, definitely check out this blog post from secondquader.com. [00:13:46] The next piece of content is how to monitor PostgreSQL twelve performance with Omnidb part Two. So this goes into more so of building a performance monitoring dashboard for PostgreSQL. So this is a very comprehensive and long blog post describing how to do it, along with a ton of graphs showing you what the output looks like. So if you're interested in building a monitoring solution using their Omnidb tool, check out this blog post from secondquadron.com. [00:14:17] Next post. Also from secondquadron.com is Barman two point eleven, barman Cloud Restore and Barman Cloud Wall Restore. So this shows you how once you have a backup using Barman, how you can do the restore process, including both the database and the wall. So if you're interested in using Barman for that purpose, definitely check out this blog post. [00:14:39] The next piece of content is snapshot isolation mode. This is from Pgsqlpgpool blogspot.com, and this is a new feature that's coming to Pgpool Two version 4.2 that basically allows atomic visibility across multiple PostgreSQL instances. So it looks like this is a feature they're building towards where essentially you have globally atomic visibility across multiple database servers and they go into some of the advantages and how they're thinking about it. Basically, this would enable this atomic visibility no matter the version of PostgreSQL. So if you're interested in that, check out this blog post. [00:15:18] And the last piece of content is the PostgreSQL person of the Week is Umair Shahid. If you want to learn more about Umair and his contributions to PostgreSQL, definitely check out this blog post. [00:15:32] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgrads.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks, Sam.

Show Notes

Episode Transcript

Other Episodes

Episode 312

Postgres 17 Commit-orama | Scaling Postgres 312

Episode 373

Planetscale Postgres! | Scaling Postgres 373

Episode 80

Just Upgrade, Vacuum Updates, Connection Strings, Postgres Showcase | Scaling Postgres 80

Partition Migration, Like Performance, Best Fillfactor, In-Memory Tables | Scaling Postgres 124