Vacuum Speed Up, Faster Foreign Tables, Fast Queries With Union, Query Optimizer

[00:00:00] In this episode of Scaling Postgres, we talk about a vacuum speedup, faster foreign tables, fast queries with union, and a query optimizer. I'm Kristen Jameson, and this is scaling postgres episode 158. [00:00:22] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is speeding up recovery and vacuum in postgres 14. This is from Citusdata.com and they're talking about an enhancement that has been made recently to the future postgres 14 to be released in the fall. And what they changed was the compactify tuples function. Now that's used in three different areas. One, when PostgreSQL starts up after a nonclean shutdown, basically a crash recovery. Also during the recovery process that physical standby servers constantly run to replay wall changes it's used there as well as by vacuum. Now, when they did some profiling, specifically the area that had issues was a high CPU load coming from heap to clean wall records and basically they tracked it down to this function. And what this does basically this is an example of a Heap page in postgres. So you have the page header and an array of items that have pointers to the actual tuple of data within the page. And during update delete operations, certain space becomes unused so it's a no longer needed tuple and it needs to be vacuumed up eventually. But those three different processes I mentioned actually do things to compact the space within a page. So basically after compaction it looks something like this today and you can have the tuples essentially out of order from the array of items at the beginning of the page. Now, in terms of the rewrite compaction process, the area of compactify tuples that was taking the longest was actually the QSort function within it to make sure you're not overriding what you need during the compaction phase. And basically they reengineered it so that you don't have to do a sort operation at all. Basically, it keeps the items in order to ensure it can always write without damaging compaction. If things do come out of order, they actually copy some items to memory to be able to do the compaction of the page, but basically they are avoiding that sort. And of course the blog post goes into a lot more detail about that, but they do have some performance changes that they show here. So number one, they check the crash recovery because that seemed to give the most bang for the buck with changing this and doing a PG bench run with 10 million rows in the database, doing a bunch of updates and then forcing a crash of the system, therefore requiring a recovery when it comes back up. After the change, it came up 2.4 times faster, so more than twice as fast. But a crash recovery doesn't happen all that often. What does happen a lot of course, is vacuum. And actually the vacuum performance changed as well. So now vacuum with this change runs 25% faster. So anything that can make vacuum run faster is going to be better. So to me, even though it's not over twice as fast during recovery, vacuum is going to happen so much more frequently that this is the more important speed up in my opinion. [00:03:32] Now of course, this also impacts the recovery process of replicas as well, so that minimizes the chance that they're going to fall behind the primary. So, definitely some great enhancements coming in postgres 14. And if you want to find out more about the details, definitely check out this blog post. [00:03:49] The next piece of content is Faster bulk insertion to foreign Tables introduction to PostgreSQL 14 Committed Features and this is from Postgresroad Blogspot.com and this is talking about an enhancement that's been worked on for Postgres in terms of inserting data through a foreign data wrapper. Now, this post goes into a lot of the overview about why they're working on this in particularly. And they're mainly focusing on the scale out potential of postgres eventually. And getting the foreign data wrapper working efficiently is going to make Postgres scale out faster. So they had someone who was testing this and they were seeing some poor performance. So they took a look at it and basically it looked like the insertions were happening row by row at a time. So very inefficient, particularly if you have a network round trip happening to an external foreign server. So basically they developed some batch methods to do the insertions and they have some of the timing changes they saw here. So for example, with a local table without using a foreign data wrapper, the insertions that they tested happened in 6.1 second before their improvement using a foreign data wrapper, it was 125 seconds after they made their improvement. The insertion into the foreign data wrapper was 11.1 second. So it's about twice as slow doing it to a local table, but eleven times faster than it can do it in postgres 13 and below. So definitely a huge improvement, but it's still going to have some sort of an impact compared to local table insertion. Now, they did mention that the foreign server that they're using is actually on the same machine, so there's no network latency, so you have to take that into account. And then they also split the target table into eight hash partitions, so presumably multiple tables that they're pushing out the data to. And with that it was just a little bit slower at 12.5 seconds compared to the 11 seconds before, but even the local table was slower. So definitely some great performance improvement improvements when doing inserts in particular to foreign data wrappers in this post. Also mentioned that they're also working on the ability to use multiple CPUs as well. So maybe you can paralyze that across multiple different servers to do the inserts. So, some more interesting changes coming in Postgres 14 and if you want to find out more, definitely check out this blog post. [00:06:14] The next piece of content is speeding up SQL queries by orders of magnitude using Union. This is from Foxtown Systems and basically they had to develop a pretty comprehensive query to return some data and they have what they called a Diamond Schema. So basically there's multiple paths to query it because everything is kind of interrelated here. You'll see, essentially it's a circular way you could do joins and when they pulled out one segment of the data that they needed, it ran in 1.5 milliseconds. So that's super fast. But then they joined to the other set of data they needed to bring in and they were using a fair amount of ores. So anytime you have ors you can think of, okay, this is probably not going to result in some great performance and with doing that, adding in that extra data with the join, it finished in 3.2 seconds, so 3000 times slower. Now some of that was due to there's more data in some of the tables that they joined to, but what their idea was is hey, maybe we can query both of these separately and then join them together as opposed to doing using the Ors. So basically they took the original query that's up here that's not using any Ors, it's just pulling the exact data needed for one set of data and they're calling the employee meal items and then they did another query to pull the customer meal items. So they query it separately, not using any ors. Then they took those results and they merged them together using a Union all. Now the original query ran in one millisecond, the second one doing the customer meal items ran in 102 milliseconds. So taking them both and then doing a Union awe to these queries ran in 112 milliseconds. So pretty fast and much faster than the more than three second query. And of course, they have some conclusions here, but basically what I look at this as when you have ors you're essentially going to slow down your queries and this is one technique you can use address each of those queries individually as smaller, shorter queries and then union them up or merge them together at the end. Might yield better performance, as this example demonstrates. So if you want to learn more about this technique, definitely check out this blog post. [00:08:39] The next piece of content, how the PostgreSQL query optimizer works. This is from CyberTech Postgresql.com and here they're just going through all the different techniques that the optimizer uses to determine the best path for queries in postgres. Now this is a very comprehensive blog post and they talk about a variety of techniques used by the planner and how he's demonstrating this is he uses queries and then doesn't explain to see, okay, what is the optimizer actually doing. So here he demonstrates things like constant folding, a function inlining determining whether to run a function once per row or once that can be applied to multiple rows by determining whether it's volatile, stable or immutable, looking at inlining and subselect flattening, and covering things like the different types of joins. So this is a very comprehensive blog post that I encourage you to read if you want to learn more about how the optimizer works to translate your SQL into the actual plan that's executed against Postgres, because it can help you get in a sense of why Postgres is doing what it's doing. So if you want to learn more about that, highly encourage you to check out this blog post. [00:09:57] The next piece of content is how to check and resolve bloat in PostgreSQL. This is from Haigo CA and the first thing they cover about why bloat happens. And basically with Postgres's MultiVersion Currency control, every update actually happens as an insert, and then the old row is eventually not used anymore and can be vacuumed up. Same thing with the deletes. The row isn't actually deleted, it's just marked that is no longer there, and then eventually it can be reclaimed. But that process of vacuum doesn't free up disk space. So basically you can have bloat if space cannot be utilized. Now, how to detect it? There's a few techniques you can use, but they're not very accurate just using base postgres. But there's an extension, PG Stat Tuple, that probably does the best job of determining how bloated your tables are. So here they did a test and they found for a fresh table the bloat would be 0%, whereas they did a I believe it was a delete process, and then they showed that the bloat was at 30%. And similarly, indexes can also be bloated and they show the bloat here by checking an index. Now then the question comes, once you've identified bloat, how do you deal with it? The first option they discuss here is vacuum full. So basically this essentially rewrites the whole table. The disadvantage is you cannot read or write to it while this vacuum full is going on. So that's really hard to do. Cluster is similar in that it's rewriting the table, but it's organizing the data on disk to the equivalent of how an index is set up. So you cluster to a particular index and it orders the data that way. But much like vacuumful, you cannot read or write to the table. Pgripak is a separate extension that does online rewriting of the tables and indexes. So you can leave the table online and use Pgripak. The downside to it is that it does take a fairly long time to run depending upon your table size, and you're going to use two to three times the amount of space of your table to do it. So if you're short on disk space, this unfortunately isn't really an option. Then the other option is reindexing. So this just addresses indexes. You can reindex them quite easily doing reindex concurrently on the most recent versions of Postgres. So that essentially rewrites the index. But of course it doesn't do anything to assist with table Bloat. But if you want to learn more, you can check out this post. [00:12:20] Next piece of content is waiting for PostgreSQL. SQL 14 allow configurable LZ Four toast compression. So by default Postgres does compress toast. So toast is where Postgres stores data that exceeds the space within a tuple, so it's a separate storage area. What it stores in that toast area is compressed. By default, it uses Postgres's implementation of the LZ algorithm. But in this patch in postgres 14, they've added the option to use LZ Four. Now you need to do a number of things to enable it, especially building it using the with LZ four. So you kind of need to know ahead of time you want to use this LZ Four compression, but the advantages was about twice as fast as the native compression. So if you're looking for more speed with using compression with your toe stables, you may want to check out this new compression coming in postgres 14. So if you want to learn more, definitely check out this blog post. [00:13:20] The next piece of content sharding postgres on a single citus node. How, why and when. So this is talking about the new citus ten that is open source. You can now do distributed tables within a single node. And then of course the reason is well, why would you want to do that? Now, this post lists four reasons you would potentially want to do that. One is query parallelization for multi shard queries. Basically, you can get query parallelization by setting up a distributed table within a single node, although Postgres already has some parallel features, so I don't know how advantageous this is compared to those features. They didn't discuss that here. Second reason is smaller indexes to create and maintain. So you can definitely do that with this, but you can also do that with partition tables if you happen to want to partition that data. [00:14:11] Smaller tables to auto vacuum in parallel. Again, partitions can give you that. So I don't know how big of advantage this is compared to partition tables. And then faster bulk data loads, which you could also get with partition tables. So a lot of the reasons they give here, you can get them with native postgres otherwise. So that only leaves the reason that you eventually want to scale out. Like if you envision your database growing super fast and needing to scale out, then at that point it makes sense to go with citus, it seems like. But they talk about how you do it, how you would set it up on a single node, and then once you're ready, how to actually bring up different worker nodes to be able to send those sharded tables to when you want to scale out. So if you want to learn more you can check out this blog post from Citedata.com. [00:15:02] The next piece of content is actually a YouTube channel. This is EDB and in the last week they've had a number of postgres webinars that have been posted, each about an hour in length. So if you want some postgres video content, definitely check out this YouTube channel. [00:15:20] Next piece of content is explaining your postgres query performance. This is from Crunchydata.com and they're talking about Explain and Explain Analyze for analyzing your queries to make them faster. Now this is a secondary post where they're talking about using PG Stats statements to identify the statements that are taking the most time in your database. Well then you want to look at each statement using Explain Analyze to determine how to optimize those queries so they're running faster, taking less resources, so they go through the process of explaining being able to interpret the output of Explain Analyze to help you optimize your queries. So you can check out this blog post. [00:15:58] The next piece of content is ansible collection for PostgreSQL and EDB components. This is from Enterprisedb.com and they've set up an ansible collection called they're calling EDB Postgres with a link to it right here on GitHub. And basically they've defined a whole bunch of ansible roles for setting up different components of a postgres infrastructure and I believe they're using Red Hat Enterprise Linux for doing this because they also intend in the future to add a Debian and Ubuntu support offering more backup recovery options such as PG backrest and some more. So if you want to try to use these playbooks and roles or even get a sense of how they've chosen to set them up, you can definitely check out this open source project. [00:16:47] The next piece of content is no space left on device. This is from Mydbainobook.org and this post talks about what happens when you run out of space on your database server. Basically the first thing is you need to add more disks or remove any unnecessary files. Now with regard to this, the very important point they make here is that whatever you do, don't touch anything inside of the data directory of postgres, especially one called Pgxlog in versions prior to ten or PG wall in versions ten and higher. But once you're doing that, some other things to keep in mind is that with that disks running out of space, did you have any walls corrupted? And they said you could check the log for some messages to see if that's potentially the case and ways you can resolve the wall corruption. Ideally you restore things from backup, but failing that you can examine the PG reset wall tool but you need to be very careful with this. So definitely what they say here quote I strongly encourage you to read the whole documentation page of it, so definitely, definitely agree with that. And you can also check for data corruption as well in your data using the PG checksums, and they have a little quick example of what it shows in the logs once they've relieved the space issue and the database coming back online. So you can check out this blog post if you're interested. [00:18:06] The next piece of content is performance improvements in GEOS. This is from Crunchydata.com. They're talking about GEOS, which is a library that PostGIS uses for doing the quote hard computational geometry functionality such as intersections, unions, differences, buffers of geometry. And they're showing here from version 3.6 onto the mainline, all the different performance improvements they've made to different capabilities of the GEOS library. So this is a great example of how they're really driving forward the performance of this library in order to help PostGIS in general. So if you're interested in that, you can check out this post. [00:18:47] Next piece of content is how to get the status of a cloud native PostgreSQL cluster. This is from Enterprisedb.com, and they're talking about how you can use the Kubectl, which is part of the I believe the Kubernetes controller, to be able to get a status of your postgres cluster. So if you're using Kubernetes with Postgres, maybe you want to check out this blog post. [00:19:12] And the last piece of content is the PostgreSQL Person of the Week is Julian Ryu. So if you're interested in learning more about Julian and his contributions to Postgres, definitely check out this blog post. [00:19:24] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive notifications of each episode, or you can subscribe via YouTube ride tunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 68

Prepared Plans, Materialization, Recursive CTEs, Safe Migrations | Scaling Postgres 68

Episode 318

Postgres 17 Beta 1 Released | Scaling Postgres 318

Episode 162

Custom Data Types, Row Level Triggers, Dynamic pl/pgsql Columns, citext Usage | Scaling Postgres 162

Vacuum Speed Up, Faster Foreign Tables, Fast Queries With Union, Query Optimizer | Scaling Postgres 158