11K Faster Configuration Change | Scaling Postgres 305

[00:00:00] A lot of times in my consulting I may get a client asking hey, is there some configuration change we can do to make the database run better? And usually my response is it only gives minor incremental improvements for the most part. Maybe you can get something 20%, 50% faster, maybe a little bit more. And we're still mostly talking about the whole system. But there is one configuration change apart from maybe shared buffers that you don't want to get wrong, and that is random page cost, and we're going to talk about that in the first two pieces of content this week. But I hope you, your friends, family and coworkers continue to do well. Our first piece of content is 100 x faster query in Aurora postgres with a lower random page cost this is from Cheyon Dev and it looks like he was having a performance issue and he suspected it might be the random page cost, which was still set at its default of four in Aurora postgres. So what is random page cost? It's basically the cost associated with a random seek on a disk. This goes hand in hand with sequential page cost, which is usually set at one. So by default, random page seeks on a disk are four times more expensive as sequential seeks on a disk. Now this was fine for magnetic drives because it's much easier to do a sequential reading of the disk versus random I o, whereas that's not the case anymore when you're talking about the super fast ssds we have today. And most of the recommendations I see are to drop the random page cost down to say 1.1. That seems to be the sweet spot. So how does this impact the planner? Basically, if your random page cost is really expensive, it's going to look at index scans as being more expensive than doing a sequential scan. Basically, in the case of one to four, it's four times as more expensive, whereas if you bring them relatively equal one versus 1.1, it eliminates favoring sequential scans over index scans, and a lot of times that leads to much better performance. So you can see it's doing a sequential scan here and this particular query took 11 seconds to run, whereas just by making this one configuration change, changing random page cost to 1.1, he ran it again and it runs in 1.5 milliseconds. And I was about to say which is 100 times faster, which is the name of the blog post, but that's actually 11,000 times faster. So that one configuration change resulted in 11,000 times faster query because now it's doing an index only scan versus a sequential scan and even shows the graph of the system once this configuration change was made and how quiet it became. Now, I don't necessarily agree with something he mentioned down here. That quote while it worked great in this case, it's worth making sure that you should only do this for queries where you're certain that an index exists. For instance, we could actually make things worse here if the index was broken or invalid, but I wouldn't worry about that at all. I mean, the possibility of an index being broken is very low. If it's invalid, the planner is not going to use it until it's valid, and the planner knows what indexes exist. And if it can't find an appropriate index, it'll do a scan of the table. So there's not too much to worry about here. But I would definitely say if shared buffers is probably the number one configuration value to change in postgres, random page cost is probably the second next piece of content there was another episode of Postgres FM, and this episode is related to the first one. Why isn't postgres using my index? So here, Nikolai and Michael discuss several different ways that can happen. The first way is, well, it's faster to just read a table than to use an index. If you have a small table with not very many rows in it, it might be faster to just read that table and don't worry about an index. Or if it has to read a lot of data, it may be more efficient to just read the table directly and don't worry about the index because that can add some additional time to look everything up in the index and then go to the heap to get the information. There was also a whole discussion on random page cost and the importance of setting that appropriately. Again, they were saying about 1.1 and leaving sequential page cost at that one level. Another reason that the index might not be used is inaccurate statistics. Maybe the planner doesn't have sufficient statistics to know the best query plan to come up with, and it may not be using an index. Another possibility is you have an expression index, but your parameters for your where clause aren't using the same function. Or maybe you use a function in the parameters for your where clause, but there's not a matching expression index for it. So those are cases where you're not going to be using an index. But I thought this was another great piece of content. I encourage you to check it out. [00:04:58] Next piece of content quick benchmark PostgresQl 2024 q one release performance improvements this is from creditive De and here they're talking about the most recent release that happened in February, but they actually backported two performance improvements back to version 15 as well. So if you're still on 15 you can get those benefits as well. Upgrading to the point version the first one is scalability improvements during heavy contention. Now this is definitely a simplistic example for their performance test. Basically they ran select TxId current for 5 seconds and then increasing the client count and measuring transactions per second. But with this performance run they show version 15.5 of postgres dropping off precipitously after 128 clients have been added. So you can see the transactions per second drop significantly, whereas with version 15.6 that had this performance improvement backported, you could see it maintains the transactions per second much better out to looks like they tested about 1500 clients, so that's great. The other performance improvement that was made was JIT memory consumption improvements, and apparently there's some bugs that were causing a memory leak in the JIT inline. However, it doesn't look like they actually found the cause of that, but they kind of patched over it because after every hundred queries the LLVM caches are dropped and recreated, so that resets things and doesn't use as much memory. And they show a chart down here going over this test where they show a linear increase in memory for version 15.5, whereas it's highly stable in version 15.6. [00:06:46] So those are two great improvements that have been backported to 15. So check out this blog post if you want to learn more. [00:06:53] Next piece of content toast and its influences on parallelism in postgresql this is from ungrez.com and they're talking about parallelism, where postgres can do certain things in parallel, like parallel sequential scans or parallel index scans. Numerous different operations in parallel, where it spawns additional workers to do that work in parallel. And there's a configuration option men parallel table scan size that helps you configure based upon the size of the table. How many workers should do this parallel work? So as a test, they went ahead and set up a table. They used external storage for their text column because they wanted to store it uncompressed in the toast, and then they put a very large value in a text column to make the whole table about 5gb. Then they ran a relatively simple query and they had two workers that launched because by default max parallel workers per gather is two. So they went ahead and set it to three and they ran that query again, but it still only used two workers. So at a five gigabyte table size, it should be using all three workers. Why wasn't it? Well, it wasn't taking into account the size of the toast. To the planner, the size of the table was 45 megabytes the size of the core table. They weren't taking into account the toast. So what they actually had to do is drop the men parallel table scan size down to four megabytes in order for that 45 megabyte table to be okay to use three parallel workers on. So the important point of this is that it looks like toast is not really considered when the planner decides how many workers to use in a parallel operation. But if you want to learn more, you can check out this blog post. Next piece of content. The default value of FDW tuple cost was updated to 0.2. What does that mean? So the FDW stands for foreign data wrapper. This is used to contact external server resources like there's a foreign data wrapper for postgres. You can talk to an external postgres database to pull data from it. And the tuple cost is the cost for pulling tuples from that external system. And its current default is zero one, so they wanted to move it to zero two. Well, why is that? Well, because the cpu tuple cost, the cost of something being in a cpu cache was at zero one. So it's crazy that contacting an external server was the same as something being in a cpu cache. So the default cost seems to be really far off. And even a default parallel tuple cost was at zero one, which is ten times the cpu tuple cost. So for an external server, they wanted to increase it more. So they went ahead and made it zero two. So that's why the default value changed. It basically puts it more in alignment with other costs. But check out this blog post if you want to learn more. Next piece of content. First row of many similar ones. This is from sqlfordevs.com, and here they're discussing a scenario where you want to essentially group by some data and just get the top value of something you're grouping by. But in this case, you don't have to use group buys. You can do window functions to do the same thing, but choose the top rank. But postgres has a feature called distinct on that's not part of the SQL standard. But in this scenario, you're able to pull from orders where the year is 2024, order by the customer id, and then order by the price descending. So you do select distinct on by that customer id. So it will pull all the columns for a customer id record where the price is the max, essentially, because you're sorting descending. So it's essentially pulling that one distinct record for a customer, and the one it pulls is the first one with a price descending. So if you want to learn more about that, you can check out this blog post next piece of content eleven lessons to learn when using nulls in Postgresql this is from Dev two in the Francisco Tizio section, and he has eleven lessons talking all about nulls in terms of setting up your tables and whether you want your columns to be null or not null, and having to set defaults associated with that typically, as well as how you can query nulls and how you can do computations using nulls. So if you want to learn all about nulls, definitely check out this piece of content. [00:11:19] Next piece of content uncovering and exploring row constructors in Postgresql this is from database rookies WordPress.com and he's talking about the row function that enables you to build a custom row in postgres. And it looks like he was using a lot of this for certain migrations that he was working on, and he shows you how you can expand these rows. So if you want to learn more about this function, definitely check out this blog post next piece of content prepared statements for PSQL operations this is from brandure.org and they had a need to run this type of query where they needed to flag a team on occasion, and it was this query, and you just needed to vary the parameters of the flag name and the team id. So as opposed to copying this query and renaming each one, they decided to do it as a prepared statement. So they prepared this statement, defining the parameters, and then they executed just putting in the parameters and they found this easier to work with. So this wasn't really for performance improvements, it was just a convenient way to, I guess, create an anonymous function of sorts. But it's a use case I haven't considered. And check this out if you want to learn more. [00:12:31] Last piece of content. The rest is history investigations of wall history files this is from crunchydata.com. They're talking about the wall files and how different timelines work with the wall as you promote certain databases, as well as reviewing the history files in the PG wall directory and how it can help you trace the lineage of how a certain database came to be in a particular state. So if you want to learn more about wall history, you can definitely check out this blog post. [00:13:01] I hope you enjoyed this episode. Be sure to check out scalingposgres.com where you can find links for all the content mentioned, as well as an audio version and transcript of the show. While you're there, be sure to sign up for weekly notifications so you don't miss an episode. Thanks, and I'll see you next week. Bye.

Show Notes

Episode Transcript

Other Episodes

Episode 11

Serializable, JSON & JSONB, Fast Data Loading, PG11 Features | Scaling Postgres 11

Episode 104

Reducing WAL, What is Null?, Understanding SQL, TPC-H Benchmarking | Scaling Postgres 104

Episode 2

Analytical DBs, Recursive Queries, Replication | Scaling Postgres 2