Table Lock Explosion! | Scaling Postgres 391

[00:00:00] So whenever postgres needs to run a query, it basically has two phases. The first phase is the planning phase, planning out how it's going to actually run this query. And then the second phase is execution, running the query. And you can see this if you use explainanalyze on a query, it will show you how much time was spent in planning and how much time was spent doing the execution. Now, what's great about prepared statements is, is basically you can do a plan once, and then the subsequent runs of those queries are faster because the planning's already been done. It just has to do the execution. Now, when you add in partition tables, that makes planning even more difficult to do because you have to partition exclusion. You have to analyze potentially 1000 partitions, or however many you've created to come up with a plan for a query. So one would think it's more important than to have your partitioned queries preplanned. But in a series of blog posts this week, we'll see how the picture doesn't seem so rosy. Also, be sure to stay till the end of my consulting corner, where I'm actually going to talk about some upcoming Black Friday deals I'm planning for my course, but I hope you, your friends, family and coworkers continue to do well. Our first piece of content is the continuing series of the Postgres marathon, and this first post is Prepared Statements and Partition Table Lock Explosion, Part one. This is from Postgres AI. And this is Nicker Nikolai from Postgres AI and Postgres fm. He's doing this marathon, so he's testing this out on Postgres18. So he did previous tests where he was tracking the different locks that would happen when a postgres backend is coming up with the plan. And eventually it got down to just one or two locks during the execution. But here he creates a table with 12 partitions with three indexes each. He's using the default plan cache mode of auto, which basically does five custom plans, then builds the generic plan in the six execution, and then it starts using that afterwards. [00:02:09] So during the first five executions of this partition table, you get eight replacements, relation level locks in total. [00:02:16] So the parent table gets locked, the three parent indexes, and then your target partition table and its three partition indexes. So eight locks in total. But then when it's building its generic plan on the sixth execution, it locks every relation, so all tables, all indexes. And then he makes a very important point quote. Imagine what happens if we have 1000 partitions and who knows how many indexes. But then in the seventh execution it drops down to 13 relation level locks. So the runtime partition pruning is working, but you're still locking all the 12 partitions when you only need to scan one. Now as a follow on with part two, he delves into the 52 locks, but it basically needs to get all the matching partitions. [00:03:06] So all 12 partitions and the 36 indexes are opened in edit. And I think basically he sums it up here. We build a generic plan, but cannot prune at planning time without the parameter values. So the planner must consider all partitions locking the 52 relations. [00:03:22] And then if we look at the next post related to this part three, he takes a look at different plan cache mode settings. So he looks at the auto mode, which is what he did the testing on last time. Then he looked at force a generic plan. [00:03:37] Well as force a custom plan. [00:03:40] And here are the different results for it. So again, this is the auto, this is usually the base case and how most people have their postgres set up. First execution you get eight locks. Again, this is the parent, parent indexes three of them and then the partition table and the partition indexes. You get the explosion in six where it needs to access all relations. So if you have 1000 partitions, this will be huge. And then subsequent runs are either eight if it continues doing a custom plan, but 13 if it's doing a generic plan. When you force a generic plan, you get the explosion of locks in the beginning because it has to build the plan, which is what Normally happens on execution six of the auto, but then all subsequent executions do 13 locks, basically needing to lock all partitions. [00:04:30] Whereas force custom plan, you get a consistent eight locks every time. And as he says here, consistent and explicit. And his conclusions down here is basically what I was mentioning in my intro is that it's even more important to have good planning with your partition tables to avoid excessive planning time. But unfortunately the state of things now, you could run into really high locking issues. And of course all of these different types of plans are related to how many partitions you have in your partition table. Now he said people are working on this problem and it's basically moving the runtime pruning a little bit earlier, but we'll have to see how that develops. So basically the question is what you should do. So in terms of plan cache mode, if you're set at auto, you can expect a lock explosion on the sixth execution and then it'll either accept or reject the generic plan. Which that's kind of the purpose of auto. You could force generic plans and have that one time lock explosion at the beginning, but then you'd have still a fair number of locks during subsequent runs. Or you could do a custom plan and have everything much more consistent, but as a consequence you're basically not using prepared statements at all, but which could cause contention with regard to the lock manager. So it doesn't seem like there's a best choice. It's kind of what is your workload doing and what's the best way to address it. Now do keep in mind, he said, this analysis relates to SQL level prepared statements and most people are using libraries that are protocol level prepared statements. Now they'll probably act the same way, he says, but still it would require looking at their behavior separately. [00:06:18] But if you want to dig more into this, definitely encourage you to check out these blog posts. [00:06:23] Next piece of content don't give postgres too much memory this is from Vonder Me and he was analyzing a create index command on a system and he was looking at the number of parallel workers 1, 2, 4 and 8 and you can see the duration in seconds here. And the more parallel workers the faster it was with diminishing returns once you get probably above 4. But what was odd is that when he increased the maintenance work memory from 64 megabytes to 128 to 256, etc. Up to 16 gigabytes, every time he increased it, things got slower. So giving postgres more memory results in the index creation taking a longer time. [00:07:06] And the question is why is this happening? [00:07:09] And the main reason I think he lists here is basically the L3 cache on CPUs because the L3 cache is much faster than the main memory in the system. [00:07:20] So if you have some of these settings that are small enough to fit within the L3 cache of your system, it's going to perform better. And he says there may be some other reasons where processing data in smaller chunks may be better because maybe the larger chunks have pressure on other parts of the system that may cause stalls to happen. And he talks a little bit about that. However, the thing that I keep thinking about is was the create index the only thing running on the system because normally you're going to have a highly active system where you're creating an index. The probability that the memory needed for that index build is going to be in the L3 cache is going to be quite low. So I wonder if in an actual production system you won't see this behavior, or at least a highly active one, but definitely food for thought and check it out if you're interested. [00:08:10] Next piece of content do you really need tsvector column? This is from depesh.com and he's talking about when you're doing full text search, what a lot of people do is create a separate column that's converted to tsvector and then query that when doing full text search. Now you can always do an expression index converting the data to tsvector, but historically I've heard that's a little bit slower, but he actually ran some numbers to take a look at this. So he's searching through Wikipedia and looking at the title of the body of the content and he created a tsvector field to store that in. He created an index with a GIST index on the TSV. He also did a gen a little bit further down below and then he created an expression index to basically test the same thing, but to make it more readable. He actually created a dedicated function for doing the tsvector conversion and used that function when he created the index. [00:09:10] Now he did the query comparison and we'll just take a look at this table down here and the results are basically if you want the best performance, you need a dedicated tsvector that's been materialized in the table. [00:09:23] Because using a Gistindex and doing a small row count, it was more than 10 times faster than trying to do it as a functional index. When looking at a GIST index with a larger row count, it was maybe around eight times faster. The dedicated materialized TS vector looking at gen indexes, the differences weren't as significant. The materialized TS vector on column on the table was about 50% faster for a small row count and about twice as fast for a large row count. So not as significant, but definitely if you want the fastest performance, you're going to want to materialize that TS vector as a column in the table. But also keep in mind he mentions that that can take a huge amount of space. So that is a trade off for doing that. But check this out if you want to learn more. Next piece of content. PostgreSQL partition pruning, the role of function volatility and he's talking about when you're using partition tables, how important it is to only select one or more partitions when you're doing a query, basically utilizing that partition pruning so you're not looking at every partition when you're doing a query. [00:10:34] And if you're working with timestamps and particularly functions to say, look at data that happened in the last three days. You want to be wary of the function you use and not use a volatile function. [00:10:46] So for example, now and statement timestamp are both stable, whereas clock timestamp is considered volatile because it returns the current timestamp exactly when it's run, whereas statement timestamp returns the same thing within a single statement and now returns the same thing throughout the same transaction. And as a consequence, if you try to use clock timestamp when doing some of these partition pruning, it doesn't work and it basically scans all the partitions. So just something to be aware of when working with these types of functions and using them in partition pruning. [00:11:24] Next piece of content you don't need Kafka, just use postgres considered harmful this is from morling.dev and this is a counterpoint argument to the first blog post that I mentioned last week titled you don't need Kafka, just use postgres. So he's kind of arguing against that, although I started reading this and I got the feeling he maybe didn't read the full blog post. [00:11:49] But I would probably summarize his argument as use tools designed for the job. But he talks about Kafka and where it's best used. Like you shouldn't use it for queuing, for example, or necessarily even postgres for example. But I will have to say I'm using postgres as a queue in my application and it's been working perfectly fine. Now, I don't have Gangbusters volume, but it's good enough now. Now if you're interested in drama, I try to avoid it, but the person who wrote the original post actually went into the comments and there's all sorts of discussion about this issue in general. So if you're interested in that, I didn't necessarily read all the comments through. You can definitely check that out. [00:12:36] Next piece of content producing UUIDs version 7 disguised as version 4 or version 8. This is from PostgreSQL Verite Pro and I've been so happy having UUIDv7 and having a timestamped ordered UUID. [00:12:52] But he brings up an important issue is that if you're actually sharing these externally, well you could be exposing when these were created and you you may not necessarily want to do that, but he has a solution for it. Basically, before you send your UUID version 7id out somewhere, you actually encrypt the timestamp, basically converting it into a UUID before and he says you don't need to store this new version because it can be converted back and forth between version 4 and version 7 just by knowing the encryption key you're using. [00:13:29] And he has an implementation on how to do that here as well as he rolled it up into a extension. So if you're interested in that, you can definitely check out this blog post and extension. Next piece of content beyond start and end PostgreSQL range types this is from boringsql.com and this is another great comprehensive blog post and they're focusing on range types and how you would use them. [00:13:56] And one of the main use cases he's mentioning here is say you're reserving rooms and in this case you're saying he's trying to reserve seats. So you want to reserve a particular seat ID with a range of dates. Now you can use independent dates to do this and you just want to make sure that you exclude using a gist index to make sure you don't have overlaps of a given seat ID within a particular time range. [00:14:23] But it's so much easier and as he says cleaner schema to use a range type. So instead of having two timestamp columns, you have a holding period column with a timestamp range. And even adding the gist constraint is a lot easier to read as well. Then he continues and talks about the different types of boundaries. One is inclusive, one is exclusive and how the default boundary is typically defined. Basically inclusive at the beginning, exclusive at the end. Talked about the different set operations that you can use in the different operators, how to handle infinity with range types as well as nulls, and even gave an example of integer ranges that are used to defining discounts. So for example, if you buy so many units of a product you would get a 5% discount whereas other units of product you get at 10% discount discount. So maybe you would want to use ranges to define that and then also talking about multi ranges. So these are basically nested ranges. [00:15:25] So this is a really comprehensive blog post. If you're not that familiar with ranges, definitely encourage you to check it out. [00:15:32] Next piece of content Alter egos. Me, myself and Cursor. This is from drunkdba.medium.com and whenever I see cursor now I think of the AI ide, but he's actually talking about a database cursor. [00:15:48] So he's running a CI CD pipeline and he got a postgres error. Can I create an index because on a particular table because it's being used by active queries in this session. He says, wait a minute, there's only one session, the same pid, it's the same transaction, there's no parallel runner, no second connection, and it's blocked. What is going on? So he did all this investigation and finally he saw something related to cursors. He's like, oh boy. And basically there was an open cursor that someone had forgot to close. He executed close all at the top of the migration file and everything worked. So someone had added an open and then a fetch close, but they didn't close it. So he definitely says, if you're going to be using this, please use the for RN select loop so you don't have to automatically remember to close it. So I found this quote pretty interesting. In short, my session blocked my session peak self sabotage. But check this out if you want to learn more. Next piece of content. PostgreSQL 18 enables data checksums by default. [00:17:02] And this is from Creditiv.de and this is so easy for me to forget that Postgres does this by default, which is great. So he talks about enabling these and how you can migrate to them. But personally I think I would just do a logical replication upgrade. I mean, that's my plan when I'm ready to move up to 18, is to just do a logical replication upgrade and then the 18 will have this automatically installed. But check that out if you want to learn more. [00:17:29] And the last piece of content, there was another episode of Postgres FM last week. This one was on gapless sequences. So they talk all about sequences and how they're not designed to be gapless, basically. So you're kind of swimming upstream if you're trying to do that. And you're better to implement a different solution other than actually using Postgres sequences. [00:17:52] But of course they talked about a whole host of other topics. So feel free to listen to the episode here or watch the YouTube video down here. And now it's time for my consulting corner. I wanted to let you know I am planning a very significant sale and kind of restructure of my ludicrous speed Postgres course. [00:18:13] This sale will be taking place during the Black Friday time period and just to let you know, this was recorded based upon Postgres 16. I haven't recorded anything or changed anything for Postgres 17. There were some changes in it, but not significant. [00:18:27] I am going to be re recording. I don't know the whole thing, but a significant portion for Postgres18 because I think there's a lot of changes that happen with that, but what's going on sale will be the Postgres 16 version, but my plan right now is once the updated course is available, everyone who bought the old course will get the new course for free. [00:18:50] And as I have more details, I will let you know in the consulting corner leading up to Black Friday. [00:18:57] I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content mentioned, as well as sign up to receive weekly notifications of each episode there. You can also find an audio version of the show, as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 46

Stuck Transaction IDs, pgBouncer Auth, UUIDs, Monitoring | Scaling Postgres 46

Episode 30

Parallelism, JIT Compiling, Indexes, Administration | Scaling Postgres 30

Episode 116

Observer Effect, Partition Management, Tuple Freezing, Hung Transactions | Scaling Postgres 116