Join Costs, zHeap, Autovacuum Tuning

[00:00:00] In this episode of Scaling Postgres, we talk about join costs, z heap, auto, vacuum tuning, and human configuration. I'm creston. Jameson. And this is scaling postgres episode 26. [00:00:21] All right, our first article is Cost of a Join, and this is from Brianlikespostgres.com and in it he is asking the question of when you have, say, a product table and you want to add a status for each of the items in that product table, what is the best way to do it? So he has an example here. Do you want to add a status ID column to the product table and reference it in a new status table? Secondly, do you want to add a status ID column to the product table and let the app application define what a status is by mapping it to something or three? Do you just want to add a text column that has the exact value of the status to that table and what's the best option? And with that he's considering, all right, what is the actual cost of a join? So he goes through a methodology of how he was testing how expensive are joins. So he basically takes a table and he joins it to other tables, increasing the row count and increasing the number of tables that are joined to. And he goes through all of his code that he used to generate these tests. And then he gives graphs at the bottom of the post here with what he found out. Now I'm going to skip down to the lower graph just because I think it's the most relevant, because he did some tests without indexes and then with indexes and of course, I would tend to go for the one that's using indexes to see, all right, what is the cost of a join doing it that way? Now, in his testing methodology, he went up to a million rows. So that is a little bit low from my perception, particularly if you're looking to scale postgres because that's where you start seeing impact for a lot of rows. And it would have been more interesting to me is, okay, what did up to a billion or tens of billions of rows, how would that be impacted? Secondly, he was only testing tables with, I believe, a maximum of three columns, or they were two or three columns and they were integers. So the tables themselves, each row didn't have a lot of data in it. So it makes it a little bit harder to compare real world applications that probably you may be joining tables that have 50 or 100 columns in them and a wide variety of data types, not just fixed integers. But using his methodology, he's basically seen going joining from one table up to 50, definitely seen an increase in the amount of time it took, again based upon the row size, but the performance is still good. And he says, but still joining 50 tables with 1 million rows each happens in just twelve milliseconds, which is great. But again, with larger tables and with a larger number of rows, what would that performance look like? Because in my experience there is a real cost to joining to multiple tables with a lot of different rows and you definitely need to minimize joining to additional tables. But another way I like to think about it when considering performance is what am I asking the database server to do? So for example, if I have three tables I need to join to and each of them have a billion rows, well, that's a lot of data to churn through. But if I have indexes, maybe I only want ten rows, then a specific enough index those tables will allow me to pull that data out very efficiently. So I'm not asking the database server to do that much. So when I think about the cost of the join, it's not necessarily I need to avoid joins, but what data do I need and what is the most efficient way to get it and how can I ask for it from the database? So it has to do less work and that will inevitably lead to great performance, but that just tends to be how I think about it. But again, I like the methodology used in this post and I like the graphing that he did here, but it would be super interesting to see for really large databases and more real world table sizes, what would that performance look like? But definitely a blog post to check out. [00:04:18] The next post is actually a YouTube video and it's very very short. It's more like a little bit like a marketing video and it's called Less Bloat, fewer Rights and just plain smaller the Promise of Zheep. And this is from the Enterprise DB YouTube channel. So basically Zheep is a way to redo the postgres heap that Enterprise DB is working on for PostgreSQL. And it's related to a post that was done back in January of 2018 called do or Undo. There is no vacuum. So essentially what the Zheep is doing is trying to avoid having to vacuum unused rows. So this goes back to when you do an insert in postgres. It simply inserts a row when you want to do a delete, it actually marks that row for deletion and it's actually vacuum that goes through and vacuums up those rows. When you want to do an update, it's basically like an insert and a delete. So whatever you're updating, the new row is inserted and then the old row is marked for deletion and then vacuum has to go in and vacuum it. Now this basically creates Bloat problems and requires vacuum to go in and clean up these unused rows. But with Zheep they're basically wanting to handle MultiVersion concurrency control a different way, where they say, quote, we handle an update by moving the old row version to an undo log and putting the new row version in the place previously occupied by the old one. So it's basically it would do an update in place but it would save the old version in an undo lock. Now, for those of you who have used Oracle, this is very similar to their rollback segments concept. So instead of putting an insert in and marking the old row to be removed at some point via vacuum, you do an update in place, but you save to another logging area what the old row was. So if you need to roll back, you can consult that undo log to get that row back. Now the promise of this, if it works, is basically no vacuum, which it sounds really great to me because I know that right now there is a big cost for updates and generally if you want performance, focus on doing inserts only and try to avoid updates, particularly updating the same row multiple times. There's a real performance cost to that. But maybe this could allev some of that pressure. Now they're projecting looking at doing this by maybe around postgres version twelve, maybe 13. They're not quite sure yet, but definitely something to keep track of. And definitely two pieces of content, both the YouTube video and this blog post to check out. [00:07:02] The next post is tuning auto vacuum in PostgreSQL and auto vacuum internals. And this is from the Procona.com blog. So basically it goes over what is Auto Vacuum, why is it needed, and I kind of mentioned why it's needed. What are ways to set up logging for Auto Vacuum? Ways to set settings to determine when Auto Vacuum runs both globally and potentially on a per table basis? How can you identify the tables that need Auto Vacuum settings tuned? Number of processes, how do you control how often it runs and what impact it may have on your database server? So, definitely a good review post on how to set up and configure Auto Vacuum for PostgreSQL. [00:07:46] The next post is actually another YouTube video and this is called PostgreSQL Configuration for Humans by Alvaro Hernandez and this is on the high load YouTube channel. Now this is from a presentation back in 2017, but he goes over most of the configuration values for PostgreSQL that you want to pay attention to. And I've also have links to his slides used in the presentation here that I'll also share. So if you're wanting to learn a little bit more about postgres configuration, definitely two pieces of content to check out. [00:08:21] The next post is actually super short, it's basically PG. Bouncer 1.9.0 has been released as of August 13. So basically you can check this out and check out the change log to see if and when you want to upgrade to the next version. [00:08:37] The last post is Performance Monitoring and Auditing PostgreSQL top resources as this is from the Several nines.com blog and this is basically a summary post that goes over each of the posts that they've done related to monitoring postgres performance. So they reference content that we've actually mentioned on this channel previously, like a performance cheat sheet, things to monitor, audit, logging, best practices, decoding PostgreSQL errors, best alert, notification, tools for PostgreSQL, et cetera. So definitely another blog post to check out. [00:09:16] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks, Our.

Show Notes

Episode Transcript

Other Episodes

Episode 96

Account Configuration, Pricing, System Identifier, HyperLogLog | Scaling Postgres 96

Episode 260

PL/Rust, Row Locks, Postgres Errors, SQL 2023 | Scaling Postgres 260

Episode 174

Lateral Joins, Generating JSON, dbcritic, Fetch First Rows | Scaling Postgres 174

Join Costs, zHeap, Autovacuum Tuning | Scaling Postgres 26