Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about HNSW indexes, vacuuming, bloat, watch me now and connections. I'm Kristen Jamison and this is Scaling Postgres episode 281.
[00:00:23] Alright, I hope you, your friends, family and coworkers continue to do well. Well, before we start the content for this week, I did want to mention we've started making some changes for not necessarily the show quite yet, but the website. So now we're starting to generate transcripts of each of the episodes. So I think we've done them for the last 20 episodes or so. It's going to take it probably at least a week from the show release before the transcripts get placed on the website. So there is a slight delay, but we have, I think, at least 20 or 25 now, and we're going to be working back all the way to episode one, generating these transcripts. So if you ever wanted to read the content as opposed to listen to it or watch the show, you can do that now as well. But the first piece of content is Pgvector version zero five faster semantic search with HNSW indexes this is from subabase.com, and we already talked about the release of Pgvector zero five in the HNSW indexes on a couple of previous episodes of Scaling postgres. But there were two posts this week that had some additional good information to provide. So this post from supabase.com, they go into more depth about how HNSWs actually work. And again, this stands for hierarchical, navigable small worlds. That's where the acronym comes from, and they give a representation of this. I'm not going to try to explain it, because I'm still trying to wrap my head around it as well. And actually the second post I found a little bit easier to interpret. So I'll leave this part of the post for you to look at and interpret how it actually works, but I'll first focus on their performance areas. So again, they're starting with something that has 1536 dimensions, and this is generally what OpenAI produces. And when analyzing 224,000 vectors, they compared an IVF flat index to the HNSW index, and with these tuning parameters to get approximately the same accuracy, or even higher accuracy. In this case, HNSW was three times faster than the IVF flat index, so that's huge in terms of a performance gain. And then they tried it with a million vectors, and they actually saw a sixfold improvement in performance, which of course is even better in maintaining that high accuracy rating. Now, this performance was so good. It is even better than some of the results produced by quadrant, which is a dedicated vector database. So that seems pretty impressive to me. So you can see the results produced by quadrant and then Pgvector, but again, you always have to be careful of comparing kind of apples to oranges here, but still impressive performance improvements. And then they said as they scaled the database up, performance scaled as you would expect. It maybe wasn't linear, but it was pretty good. They talk a little bit about the build parameters that they set for building the index, and generally higher values are better for higher dimensionality, but it takes a very long time to build the index. And the other benefit that they mention here compared to IVF flat is that generally you need to load all the data in the database and then build your IVF flat indexes because it relies on the existing distribution of the data to set those up. But you don't actually need to do that with the HNSW indexes because they say, quote, you don't need to fully fill your table before building the index. So you can just add the indexes and start adding data like you would do frequently with btree indexes or other index types. But they also mentioned the improvements to the IVF indexes, and mostly this was regard to build times and parallelization with regard to that. But they do mention here the HNSW versus IVF flat indexes is that the HNSW will always be slower to build and use more memory than IVF flat. So if you're memory constrained, you're probably going to need to stick with the IVF flat or you don't have a lot of time to build the index. And of course they mentioned this extension is now available in their subabase offering. But the next piece of content is HNSW indexes with postgres and Pgvector. This is from crunchydata.com and they were saying that the ivflet indexes are basically a list based, whereas the HNSW is a graph based algorithm. So it's an entirely new way of doing approximate nearest neighbor searches. And they say these indexes will use approximation so they're not precise. They're more performant than the list based indexes like IVF flat. They require longer index build times, which we mentioned before, and require more memory to use them, because ideally you want all of them to be in memory to have the most performant results. Now, I think this post did a better job of explaining what an HNSW index is or how it works, and basically imagine you have this 2d representation with each of these points representing a vector or a point in your data. So they have a first level of a hierarchy that's very sparse with the data, and you pick a point and you find, say the five nearest points. I think they're talking about the ten nearest points. Once you have that, you then refer down to a more dense level and get the ten closest related points to that. And then you do it again for an even more dense hierarchy. So apparently you're becoming more accurate as you're going through the different hierarchies. But I do encourage you to read this. They have links to some of the research that was discussed, as well as the source code of how this actually works. So if you want to dig into the details, definitely use this as a reference. So they talk about the advantages. Basically it is faster, usually with higher accuracy and the trade offs being the build time and then needing a lot of memory to be able to use these indexes. And then they go into index tuning. There are these two parameters that you can change as you're building it. M is the number of connections to the nearest neighbors or the vertices, and EF construction is a candidate list size during the index build they say here, and the EF construction should be twice the value of m, and the higher the values, the slower the index building. But again, if you're going to be using this, definitely suggest you read this post more in depth to get a better understanding of it. There's also a query tuning value you can adjust for EF construction value that you can do at query time as well. So not just at build time, but this can let you adjust the number of nearest neighbors maintained. And then lastly they go into a code sample using their postgres AI tutorial that they have on their site, and they go through the process of doing a query, seeing a sequential scan and then applying an index to get the results and seeing the index scan. So definitely if you're using AI and you want to start using PG vector and these new indexes, check out these two blog posts.
[00:07:00] Next piece of content optimizing postgres's auto vacuum for high churn tables this is from Timbo IO and they're talking about the need for vacuum because when you do an update it's more like doing an insert and then a delete of the previous record. And a delete operation doesn't immediately delete the row, it marks it for deletion and it gets deleted later. And that's vacuum's job to go in and actually remove these rows. Now as a consequence you can end up with table bloat with all these unused rows in the table. So if you're doing a lot of updates, you're going to have a lot of dead rows in the table. And they show an example of that here where they create a table, they insert a million rows. I think they turned off auto vacuum. Yes. Here they did a query and it ran in 191 milliseconds. They then updated the whole table, I think five times. They ran the query again and it took about 100 milliseconds longer to execute because it had to go through all these extra, it looks like a 50 million rows. But then once you do a vacuum full, all those dead tuples are gone as they show here, and your performance gets back to normal, back to the 190 or so milliseconds. Now there's other posts that of course have covered this, but this is where it kind of gets interesting because they have an extension they develop called PGMQ and typically how this queuing system works, you insert data into it, then maybe you do an update or two to change the status and then you delete that entry from the table. So that's typical queue usage. So with that insert update once or twice delete, you're going to have a lot of dead rows in that table relative to the number of live rows in the table. So maybe your queue, you want to typically keep it at maybe zero, but if it's highly active, you're going to have tons of dead rows in that table. So they use this as an example and they set the system configuration back to the default in terms of auto vacuum settings. Then they did a PG bench run, and you can see from this graph here. And we'll just start here, here auto vacuum runs. So now you have max TPS and the lowest latency. So the purple's latency, the pink or the red is the transaction per second. Over time the transaction per second drops and the latency increases. Then auto vacuum runs again. TPS is boosted up to its max again. Latency drops down to its minimum, and then it continues to get worse until auto vacuum runs again. So this is the cycle you'll run into. But then they said, all right, let's change different configuration parameters to see if we can minimize this. So the first thing they did was they dropped the vacuum cost delay down to ten milliseconds. So they halved it and they increased the cost limit to basically its max. So cost limit is how much work can be done while doing vacuum. And then the cost delay is once that limit is hit, how long should I pause for? So basically they allowed a lot more work to take place during a given sequence of work, and then only delay ten milliseconds for the delay before they get to do work up to this limit again. Now with that you did get a little bit of a TPS increase and a little bit of a latency reduction, but not too significant. The graph still looks relatively similar, but it does have some higher TPS counts and a little bit lower latency counts. So next they wanted to say, all right, shall we run auto vacuum more frequently? So they dropped the vacuum scale factor to see what difference that makes and not too much of a difference. Then they said okay, let's reduce the nap time in half. So this is before it starts vacuuming the table how long to wait? And with this you could see now the vacuums are happening twice as fast, so it definitely needs to be vacuumed. But this nap time was causing it to wait to do that vacuum. So you definitely get more vacuum performance by doing that. Then the next thing they adjusted is the cost delay. Instead of ten milliseconds they dropped it to zero, so not too much of a difference with that. And then they dropped the nap time down to 10 seconds as opposed to 30 seconds. And now you can see vacuum is running quite frequently and busy, but you see less of a TPS impact like you saw in the previous graphs as well as latency. The only thing else that would have been interesting to find out is how the cpu usage changed. So yes, we're configuring auto vacuum to run more aggressively. Well, what impact did that have in the overall system? Did the cpu utilization increase as a result of running more quickly? But I found this really interesting because you could see how changing individual configuration settings adjusted the latency and the TPS performance for this workload they were looking at. And if you want to learn more about it, definitely check out this blog post.
[00:11:35] Next piece of content postgresql 17 allow watch to stop based on the number of rows returned. This is from dbservices.com. And watch allows you to run a command repeatedly after a delay. So for example, he runs a PGStat activity query here, and then he types backslash watch and it runs the command by default every 2 seconds. So he gets whatever the results are of the command being run again. And I should mention this is within PSQL. So this is using the PSQL client to do this. Now, historically you could add one more parameter to watch and that was to change the amount of delay. So by default it's 2 seconds, but you could add a different number here and it would delay that many seconds. So again, this is great for doing DBA related stuff. Sometimes I just set a query to run for every 10 seconds or 20 seconds just to check the status of some job running. Maybe it's an index build or something like that, but as of 16, which might be released in the next week or so, in addition to the interval, you can also specify account. So if you only want it to run for three times, as you can see here, it's going to run that query three times and then stop. Or you could specify any count you want. But in 17, which we're probably a year away from, they added another parameter called min rows. And what this does, it keeps repeating that query until the number of rows returned is below the value that you set. So this particular query is returning three rows. So he says backslash watch men rows equals three. So it'll keep running because there are three or more rows, but once it drops to less than three rows, it then stops the watch. So these are some pretty cool features coming in 16 as well as 17, which unfortunately is about a year away.
[00:13:23] Next piece of content is also related to this particular feature, but depeche.com also mentioned it in his blog post. Allow watch queries to stop on minimum rows returned. Next post, also from depeche.com is waiting for PostgreSQl 17 generate new log for trust connections under log connections. So basically if you're logging connections, I guess it didn't record anything before when someone connected via trust, but now it will log trust methods to the log if someone logs in using that method. So that's great from a security perspective. The next post, also from depeche.com is waiting for postgresql 17 add to bin and to Oct. So these functions allow you to convert numbers into their binary or octal representations. So if you want to learn more about these feature, definitely check these blog posts out. Next piece of content there was another episode of Postgres FM last week. This one was on connections, so they talk about different ways you can connect in terms of options, security performance, trade offs of them, primarily with regard to connecting over TCP IP or using Linux sockets. But they also talked about some difficulty of users have actually connecting to postgres when they install it. But if you want to listen to the show you can click here or check out their YouTube channel here. Next piece of content. Now it's PostgreSQL's turn to have a bogus CVE. This is from OpensourceWatch behive.com, and the subtext for this is that PostgreSQl and Curl aren't the only ones someone is faking security alerts for numerous open source projects. Now this next post I'm going to show is referencing this CvE where Postgres says it's not a security vulnerability. And apparently someone posted a vulnerability for Postgres 12.2 where you could create a denial of service by repeatedly reloading the database, but by virtue of the fact you have to be a super user and have the privileges to be able to do that, they don't really consider it a security vulnerability. And that's exactly what this post is talking about. Someone apparently is posting bogus security vulnerabilities, so this is something I haven't really heard of, but if you want to learn more about that, definitely check out this blog post. There's also the companion post done by command prompt, where they sent out a courtesy notification about this purported security vulnerability as well. Next piece of content PostgresQl 16 improves Infinity PgSql Friday twelve this is from Rustproof Labs, and the feature he's most looking forward to in Postgres 16 is plus Infinity. So he's been able to use infinity with numerics, but now it can work with timestamps, so he's super happy about that. But check this out if you want to learn more. And the last piece of content is exploring postgres performance, a deep dive into PG StaT statements. This is from Stormmatics tech, and this is pretty much my favorite postgres extension because it allows me to highly optimize my database and they go through all the different parameters that you can set it up with, as well as all the different columns and what they mean from the PG Stat statement system view that's provided, as well as they give you a few commands so you can find long running queries, IO intensive queries, and even temporary IO intensive queries. So definitely check out this blog post if you want to learn more.
[00:16:47] That does it for this episode of Scaling Postgres. You can get links to all the content mentioned in the show notes. Be sure to head over to scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or iTunes. Thanks us.