1,400 Times Faster Max & Group By Performance

[00:00:00] When it comes to benchmarking, I kind of think of the way to find an accountant. So the best way is ask an accountant what is one plus one? And you don't want them to say two. You want them to say what do you want it to equal to? Because it can get really hard to get consistent results when benchmarking across different environments. In a case in point, I took a look at this blog post I'm showing right now. He gave two solutions to a problem. One was about twice as fast, but when as I tested it, the second one was twice as fast. So we'll take a look at that in this episode. But I hope you, your friends, family and coworkers continue to do well. Our first piece of content is performance tuning Max and group by this is from cyberkyphenpostgresql.com. And I always like their blog posts because it is super easy to rerun them because they put all of the code that they use in their examples here. And he's giving a scenario where you have some sort of time series data you want to track that just has a timestamp, a integer for a sensorid, some sort of identifier, and then whatever measurement that sensor has taken, and then for 100 sensors, insert 43 million rows. And he wants to ask a pretty basic question, give me the highest value for each sensor. So of course, the easiest way to do this is just select the sensor id, get the max measurement from the table group, and then order by the sensorid. Now, there's no index on it, so it does have to go through every row, which is about 3.6 seconds to run. He says, okay, let's put an index on it. And he put a multi column index on sensorid and measurement on the table. He ran the query again, and he doesn't show the timing, but it did not lead to a better execution plan. So in spite of this index, it was still having to essentially scan the whole index. So basically he wants to do a skip scan. So basically break the problem down into, well, how long does it take to get the max measurement from a given sensor? And the answer to that can be retrieved rather quickly in just over a 10th of a millisecond. So in theory, if we can put all 100 sensors together, we should get it as fast as 1.4 milliseconds, potentially, or at least that's the target. So we had one implementation using a lateral join and another implementation using a recursive CTE. So in the lateral join method, he's basically just select all from one to 100, which are for each of the hundred sensors, and then do a lateral join which essentially runs this internal query here for each sensor id. So it's basically doing this query against the index that was created 100 times. So it's like a for each loop as he shows here, and then doing that got it down to five milliseconds. The other option was a recursive CTE where he's essentially grabbing the smallest sensor and then recursively grab the next sensor that's larger than the previous one until it's expired. And then for each sensor grab the max measurement for that particular sensor. In this query, the explain plan is quite a bit longer, but it ran in 2.5 milliseconds. So essentially the recursive query was twice as fast as the lateral join. Now I was replicating this because I was testing out some different things, but what I noticed is that at least in a cached state, I was getting twice the performance for the lateral join versus the recursive CTE. [00:03:39] Now I'm using postgres 16.2. I'm not quite sure what version he was using, so it just goes to show you, even though we're running essentially the same code, I mean, maybe our postgres versions are different, but you can get pretty significant differences. Now, it also may be due to the fact that these are non cached values, whereas I was looking at predominantly cached values, but I still find it interesting. And it also goes to show you, whatever your environment is, it does pay to do empirical testing to validate that a given solution is the best one for your environment. [00:04:13] Next piece of content scalar and binary quantization for PG vector vector search and storage this is from jcats zero five.com, and Jonathan is talking about a new version of PG vector that's coming 0.7 that will start to offer scalar quantization. [00:04:32] And what this is, it's dropping the size of the dimensions down from a default of a four byte float to a two byte float, or maybe even a one byte integer. And specifically, PG vector 0.7 will be introducing an option to index two byte floats they're calling a half vec or a half vector. They're also offering a binary quantization that can reduce the dimensions to a single bit, which seems extraordinarily aggressive. But this will be offered in the upcoming PG vector as well, and they're also adding support for sparse vectors via sparse VEc, but that will be talked about in a future blog post. So some of these features will allow index vectors to have more than 2000 dimensions. So in the case of the half vec you can store up to 4000 dimensions and the bit data type can store up to 64,000 dimensions and then Sparsefac, which will be discussed later, can store up to 1000 non zero elements. So the rest of this post, he looks at features that are coming in 0.7 and seeing what the size difference is, the build time the recall is, as well as the query performance. [00:05:50] So he talked about his test setup and configuration. He's using a pretty large box here with local drives, not network attached storage. He checks out the half vec first. [00:06:02] So using these different measurements, he got anywhere from a 1.5 to a three fold space reduction, and I would say a nominal speed up to maybe twice as fast, some of them maybe three times as fast in terms of a speed speed up in terms of build times. [00:06:23] And then with regard to recall, the values mostly appear identical. So it's negligible going from a four byte vector to a two byte half vector for the indexes in terms of recall. And you get just a little bit of a boost with performance as well in terms of both queries per second and latency. So that seems like a great outcome. By using half vectors for your indexes, you're able to save on disk space space. Your index times will be lower, but you can maintain your level of recall. Your queries per second go up slightly and even your latency. Next, he tried binary quantization and he is showing how you can create the expression index for HNSW indexes here. But he did say with this one that if you're going to use it, you're probably going to have to run a re ranking query. So if you look at what he's doing here, he's having an internal query that does the bit wise quantization and he limits it by 800, so a much wider range. And then from that set of data he goes ahead and does another nearest neighbor search to pull out the top ten. So because so much detail is lost, I guess I'm kind of thinking it like a Bren index where it only does ranges and then you have to go get the detailed data. So you're casting a wide net with these binary quantized indexes and then you do another nearest neighbor search using the whole vector to get down to your limit of ten. So let's see what this looks like. So this is the index space reduction and speed up time. And with our first data set, the space reduction is maybe 2.6, which is pretty good, but it gets as high as 20 for the just 960 Euclidean and 16 with the OpenAI 1000K angular. And the speedups ranged in the data sets from a little over one to as high as 4.5 in some cases. So that's some pretty significant space savings for some of these. But let's check out the performance and by performance we mean the recall as well as the queries per second and the latency. Well, the picture is not so good for the recall, especially for these first two data sets it's 2% or 0% and this is without doing the re ranking. But with the OpenAI 1000K angular set it gets up to 60% to 68%, which is okay, but it's not great. So I'm not even going to look at the performance with this because the recall is so low. But by doing the re ranking you started to get a little bit better recall up to as high as 15% for the first data set that Sift 128 Euclidean. The Gist 960 Euclidean still is 0% recall, which is pretty bad, but it gets up to as high as 99.8% in the open I 1000K using the reranking query. In addition, you do get bumps in queries per second and latency. I mean, it's less than a two fold improvement, but still the binary quantization seems to depend a lot on the data type because if you can get a 99% recall but have a I think it was a 15 fold or 16 fold index reduction in size. That seems really significant to me. But taking a look at his conclusions down here is that the scalar quantization from four bytes to two byte floats looks like a clear winner. So you save some index space and you get minor improvements in the queries per second in latency with no change in the recall. And he says binary quantization works better for data sets that produce larger vectors that can be distinguished by their bits, so definitely dataset dependent. He suspects that at larger data sets you're probably going to see some degraded recall results, but he says overall the quantization allows you to reduce your storage or memory footprint, which means you could either have smaller instances or potentially scale your workload. But if you work with AI, I definitely encourage you to check out this blog post to see what's coming in. Pgvector next piece of content waiting for postgres 17 faster vacuum with adaptive radix trees this is from pganalyze.com and he's talking about an enhancement that's hopefully going to be making it into 17, where they did an implementation of adaptive radix trees. And this can be useful in the case of vacuum. So when vacuum works, it needs to identify all the dead tuples, find their locations, and then it needs to delete those index entries. So it needs to save or remember where those dead tuples are before removing them from the indexes. And right now, today, it has a memory limit of 1gb. And the particular implementation is just using an array. But using these radix tree implementations, they appear to be more memory efficient. So the vacuum can remember a lot more dead tuple ids before it runs out of memory. And as a consequence of running out of memory, what that means is vacuum must stop identifying all the dead tuples, vacuum all the indexes, and then it can go back to vacuuming all the tuples. So if you have a lot of rows that need to be removed, you can get in the cycle of vacuum running, then running through all the indexes to remove the index entries, going back to vacuuming the table again, hitting a memory limit, then go through a cycle of cleaning up the indexes again, then going back to vacuuming the main table. And essentially, as it says here, you have multiple vacuum index phases that are happening. So to see this in action in the video and in the blog post, he did a test in postgres 16 and 17. So he created a, I think it's 100 million row table, created an index on it, and then updated every value in it. So basically there's a lot of data that has to be vacuumed. I think he did leave the auto vacuum work memory at the default of 64 megabytes. So in 16, as it was running, it ended up hitting that limit nine times. So you can see the index vacuum count was nine times in postgres 16, where it hit that limit and it had to stop do the indexes and then come back again to vacuum the table. In the case of postgres 17, it only had to do the index phase once. And it looks like they used a maximum of 37 megabytes of memory. So not multiple 64 megabytes of memory, but only 37 megabytes of memory, which is shown here. And you can see that the index vacuum count was just one. So that's definitely more efficient. And he says it also reduced vacuum completion from 773 seconds on postgres 16 to 619 seconds on postgres 17. And I imagine this could be an even greater win the more indexes you have on a particular table. So this is great news. Definitely hope it makes it into 17, but you can check out this piece of content if you want to learn more. [00:13:27] Next piece of content PostgresQl 17 split and merge partitions this is from dBI services.com and apparently there's a patch for 17 that allows you to merge two partitions together or even split them apart. Now this applies for list and range partitioning. I don't think it will work for hash partitioning, but he has an example of here of list partitioning where he has four partition tables with values in ABCD. He then merges partition one and two into a new partition he's calling partition twelve or partition twelve together. You can see it has values a and b, so we didn't even have to define this as part of the alter table command. He just said merge these two partitions together and it knew how to do it. I'm assuming in the case of range partitioning they would need to be contiguous, but I don't know. And then to split the partitions apart you can split them and then define how you want to split them, as you can see in this command here, and it splits them back up to the way they were before. [00:14:30] So this is great. Unfortunately, the downside is that you need an access of exclusive lock on the parent table. So it appears like everything is going to be locked while it rewrites everything. So I don't see a lot of use cases for this at this point in time. But if additional patches come out and they reduce some of this locking requirement, you know, much like concurrent indexes, we can now apply indexes concurrently. Hopefully in the future we'll be able to merge or split partitions concurrently. But check out this blog post if you want to learn more. Next piece of content using common table expressions transforming and analyzing data in postgresql part two this is from redgate.com. This is a follow up to the part one of this post where they were talking about instead of thinking about extract, transform and load, maybe we ought to be thinking about extract, load transform, meaning load the data into the database and transform it there. So in part one of this post, they took data from advent of code 2023 day seven that was talking about a hand of playing cards that had different bids applied to them, and he was doing all sorts of operations and different functions to transform the data. Well, in part two he says all right, let's apply some common table expressions or some ctes to further transform the data. So definitely if you're interested in learning more about SQL, especially ctes, and working with them. Definitely encourage you to check out this blog post next piece of content Protect your postgresql database with PG TDE safe and secure this is from procona.com and they are basically announcing a new extension they call PG TDE which stands for transparent data encryption. So this allows you to encrypt the data in your tables transparently. Personally, I've always done application based encryption, so the keys typically reside there. But if you have a larger organization or you want all the data locked down at the database level, solutions such as this could be a great tool to do that. So definitely check out this blog post if you want to learn more about that next piece of content building the best PostgreSQL Gui announcing our acquisition of Pop SQl where I think it's pronounced popsicle but this is from timescale.com and apparently they acquired Popsicle which has it stated down here is an SQL editor for team collaboration, so it definitely looks visually appealing. But apparently the significant big feature is being able to edit SQL collaboratively. And as you can see here, it looks like they have two users who are editing something at the same time. I don't know how widespread a use case that is. Personally, I haven't had a need to do that, but maybe you do. Have you ever used Popsicle? I wonder. And they're biased, but they definitely see an improvement in Popsicle compared to something that was designed in the desktop era such as Pgadmin or DB Beaver here. But if you're looking for an SQL GUI, maybe you want to check it out, particularly if you have a need for the real time collaboration component. [00:17:44] Next piece of content there was another episode of Postgres FM last week. This one was on don't do this. So they basically covered this postgres wiki post don't do this. And I think we covered it on a scaling postgres episode a long while ago. Well, they cover it in their episode here and give their opinions on it. So you can listen to the episode here or you can watch the YouTube video down here. [00:18:09] Last piece of content compiling postgreSQL on macOS to test documentation and patches this is from andyatkinson.com and if you want to set up a postgres development environment on your Mac, maybe you want to check out this post to see how he did it. [00:18:25] I hope you enjoyed this episode. Be sure to check out scalingpostgros.com where you can find links to all the content mentioned as well as sign up to receive weekly notifications of each episode there. You can also find an audio version of the show, as well as eventually, a full transcript. Thanks, and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 365

Here Comes Async I/O! | Scaling Postgres 365

Episode 296

State Of Postgres 2023 | Scaling Postgres 296

Episode 239

Shared Buffers, Compression Algorithm, Merge Command, Postgres Contributions | Scaling Postgres 239

1,400 Times Faster Max & Group By Performance | Scaling Postgres 311