Postgres LLM OS & 30 Times Faster Index Builds

[00:00:00] Before we get started with the show this week, I wanted to let you know that I am planning a webinar. This is kind of associated with a course, but this is a free webinar that I'm going to cover, basically the framework that I use when I do consulting work in terms of trying to identify performance opportunities. So these webinars will be held over the next few weeks, and they are also entirely free. So I welcome everyone to attend, and this will be the top link in the show page for this week's episode. Thanks. Now on to the show. [00:00:34] I struggled this week to pick which post should actually be the top post. I kept flipping one back and forth because both of them are very good. One is specifically focused on performance, the other one just has a lot of implications for the future. So really there's two best posts this week, but I hope you, your friends, family and coworkers continue to do well. And our first piece of content is postgres AI. Bot towards llmos for postgres. This is from postgres AI, and if you've used Chat GPT to ask it questions about postgres, well, this is the same thing, but on steroids, and then adds something even more to it. So let's take a look. So first, they're building their own bot. They're using Chat GPT four turbo, and they're explicitly using specific sources. So number one, they're using the documentation of postgres. Two, they're using the source code. So definitely being able to read the comments. I don't know how well these models can actually look at source code and interpret knowledge from it, but maybe eventually. And then also specific articles and blog posts he calls credible or from verified authors. And I should say the person writing this is Nikolai, who also does the postgres FM podcast. So basically this model is focusing on high quality sources, and he even lists some of the articles and blog posts down below that he's actually using. The other thing that they're doing is also verifying what has been stated through DB experiments. And this is something that postgres AI has been doing for a while in terms of database, then cloning to very quickly copy databases to do different experiments. So basically that's the other service other than just having the chatbot available. Now, this is the architecture of what they set up. You could see the data sources here. Looks like they're also talking about mailing lists as well. Also the results of DB experiments that have been done. And then the other pieces of the stack. He shows an example of talking with the chatbot and some of the results that were received. And he also makes a point of mentioning he lists the source material for everything, unlike what Chat GPT does today, where you have no idea what sources this knowledge comes from. He felt it was important to put the source for the response to questions posed, and he goes into the retrieval augmented generation or the rag, basically how most of the data is processed and stored. And he's using postgres, the PG vector using postgres, a custom crawler and a scraper, as well as a vectorizer using the OpenAI API. So basically he's intending to create an llmos, a large language model operating system dedicated to postgres for answering postgres questions as accurately as possible. And he says a few people have tried it and they found better responses than what Chat GPT would offer. But again, the other part that he's offering is these database experiments, and I know personally that's what I love to do as well. Once I have an idea on something, I like to empirically test it to make sure that oh, this is the best path to go. So I found this post super interesting and this is a paid service. Normally I don't make those the first post, but I found all of this content super interesting. But it's also the case that you can build your own Chat GPT with your own data and you could implement something similar because he's listing all the sources that are being used because at this point the knowledge for the model are all known sources or publicly available sources, although as they do more experiments that would probably be more proprietary data that they could offer in terms of their responses. But I thought this was super interesting and welcome you to check it out. Next piece of content, PG vector zero point 60 times faster with parallel index builds. This is from superbase.com and as you can tell, 0.6 was released. And the main feature for this release, at least according to this blog post, is parallel builds for HNSW indexes. And as a reminder, these are the indexes that have super high recall, which to me means high accuracy or higher probability. You're going to get the actual values that you're looking for, and they're much more easily updated than IVF flat indexes. The problem with them, it took a very long time for them to build compared to IVF flat, but now that you could do it in parallel, they're much faster. Like in the first example here they show we're the same machine using these parameters M 16 and EF construction of 200 it was nine times faster building on version zero six compared to version 00:51 and then down here they show a few different parameters where you can see zero six being much faster. It's also much more consistent compared to zero point 51 pg vector with the non parallel index builds. Now the charts above were from a 16 core machine. When they went to 64 cores it went from nine to 13.5 x faster. So it's definitely not linear, you're not going to get linear performance, but still it improved it further. But then you may be thinking where is this 30 x you're talking about? Well, it's for unlogged tables. So when they tested unlocked tables versus the zero point 51 build time, although I don't know if it's unlocked here, that's when they got the 30 x improvement for one value. So not everything is 30 times faster. But in this blog post there was nothing that was less than nine times faster. So still a significant improvement with these parallel index builds. So I don't really know why people would still be using IVF flat after this. If you happen to know, let me know in the comments section. [00:06:31] Next piece of content Pgbouncer 1.22 is released. This is from pgbouncer.org and the main feature that was added is the ability for PGBouncer to respond to discard all and deallocate all commands. So apparently I think some clients that were using it for using prepared statement support in transaction pooling mode, they needed to be able to send these commands, because what discard all does, as the postgres documentation here says, is it discards session state. So if you're in transaction mode, it can become important to reset the session state if you're using prepared transactions and the other one de allocate basically de allocates a prepared statement and all basically de allocates all the prepared statements. So if you're using PG bouncer and you're wanting to use it with prepared statements, I would say you would definitely want to upgrade to the most recent version of PGBouncer here. [00:07:25] Next piece of content I finally started doing my interviews about postgres, and as I had mentioned on the show before, I'm doing it actually on the rubber Duck dev show and then I'm just sharing the content here. So my first interview was with Michael Christopher of PG Mustard and of Postgres FM, and we talked about postgres explain. So this is definitely a longer format, about an hour in length. So you can either listen to the YouTube video recorded here, or the podcast down below. [00:07:56] Next piece of content is actually a gist from GitHub where they're talking about the postgres extension ecosystem. And basically this is a jobs and tools post highlighting what they're trying to build this extension to be. And I believe this is David who's working at Tembo IO now, who started the Postgres extension network, and this is basically what they're looking to build. So here are the numerous jobs, one being authoritative and basically be the official canonical source of record for all postgres extensions. They're using examples like the Go module index for Go packages, the crates IO indexes for Rust, the pause indexes for Perl. I wonder what that acronym means? The Ruby gems for Ruby, of course, which I'm very familiar with, and PyPL for Python packages. So basically they want to make an extension network matching what these are doing. Well, but this whole post lists about 20 different ones. A few other ones are they want to make it beneficial for people to publish their extensions there, make it integrated to make it easier for developers to start writing and publishing extensions, make it easy to find new extensions, and it goes on and on. It's basically a requirements document for what they kind of want to build. So you can check this out if you're interested. I'm definitely going to be keeping my eye on this. Another poster related to it is that David posted PGXN challenges. This is from justotherory.com and he's talking about the challenge of converting PGXN into this new extension ecosystem, basically. But he says there's a number of challenges with it as it exists today. In addition, he talks about all the different languages that they're using, like I think the site itself is written in Perl, but then they have a client written in Python and developer tools in Ruby, so it makes it hard to have developers come in and work on it because you have to be a polyglot to work on different things. But if you want to get more of his insight, you can definitely check out this blog post. [00:10:03] Next piece of content pGsql Friday 15 UUId let's fight this is from mydbainoebook.org and she's thrown down the challenge of talking about UUIDs for the next PGSQL Friday. So the posts need to be in by the Friday before this episode goes out. So basically we'll just have to see what posts come out next week. But she wants to know why you use uuids, problems you've encountered what kind of uuid do you use? Do you use them with sharding? Do you sort them? Do you use extensions related to them? Etc. Etc. I know that I will say I haven't really used a lot of uuids. I've basically only used them when Javascript has generated a unique identifier for someone and it just sends it to the database. I haven't used them for primary keys in my particular projects. I do have clients that have done it, and I will say it seems that projects that use uuids for their primary keys have the largest amount of wall generation I've ever seen. So that's something to keep in mind if you're thinking about using uuids. I'm assuming that will improve once UUid seven becomes available in postgres, but we'll just have to see and this other post is the same announcement, but just on the PGSQL Friday website next piece of content who contributed to PostgreSQL Development in 2023? This is from rhos blogspot.com and here Robert posts the top contributors to postgres for last year, and this is sorted by, I believe, the number of lines of code. And this list is for the principal author of the commit. He also has another one showing committers who did the most work but were not the principal author on particular commits, as well as people who sent a lot of email to the PGSQL hackers list. And as you can see, Tom Lane was the top in all of these. But thank you everyone who works and contributes to PostgreSQL. I greatly appreciate it. I benefit highly from it, and I hope you enjoy these scaling postgres episodes because it's, I guess, part of my way of giving back. [00:12:14] Next piece of content. There was another episode of Postgres FM last week. This one was on bloat, and here Michael was joined by Chelsea Dole, who's a staff and software engineer and tech lead of the data storage team at Brex, and they talked about Bloat, how to identify it, how to resolve it, and also the dangers of it. But I guess the major danger of it is if you become in such a bloated state, auto vacuum can't clean up tuples fast enough and your TxiD starts extending. That's probably the big danger. But also if you're severely bloated, as they make mention of in the podcast, you're basically hurting your I o because there's all this dead space in your data, and when you're pulling pages from the disk. It doesn't hold relevant information for whatever percentage of your bloat is they did mention in terms of resolving it, basically optimizing your vacuum settings to help keep a lid on bloat. But if you really want to remove all of the bloat, of course you can do a vacuum full, as long as you don't mind shutting off reads and writes to the table, or if you need to do it online. There is PG repack and they did mention that they've heard people having issues with it. I know I tried it on a terabyte table in a test environment and it didn't really work or it hung up or it didn't go to completion, so I haven't really used that effectively, so I can definitely attest having issues with it, although I know others have had success with it at this point. My preference for getting rid of bloat is actually doing a logical replication upgrade. And you don't officially really need to do an upgrade, you can just create an entirely new database, do a logical replica and do a failover to it. And that'll get rid of all your bloat. Well, as they make point, I'm not all the bloat, but at least a significant part of it. So personally that's my go to, but definitely interesting episode. I encourage you to check out next piece of content what's rows removed by filter in postgresQl query plans all about this is from Andy atkinson.com and he's talking about when you're looking at an explained plan and you're doing a query if it has a filter on it, meaning it's doing a filter not just on an index, but on looking at a particular set of data. It tells you how many rows were removed by this filter. Now in this particular scenario, they're querying a table, looking for a specific name code. Now there's no index on it, so it has to do with sequential scan and they're not doing an order by, but they just have a limit. So look for this particular name code and then limit one. So in this example he was looking at the rows removed by filter and it reported every row being removed but two. But there was only one row with this name code, so he was a little bit confused by that. But what's happening when you ask it for a specific namecode and you just say limit one and there's no order? Postgres is going to start scanning through the table. As soon as it sees that name code, it's going to grab it and it's going to stop because all it needs is one know limit one. It's not going to continue scanning through the rest of the table because that would be unnecessary work. So how many rows were removed were based upon the value you are looking for, wherever it is within the physical structure of the table or the heap. But if you want to learn more, you can definitely check out this blog post next piece of content group by reordering this is from blog Anirat Info and he's talking about a new commit that was added to postgres that does as it says, group by reordering. So it may choose to adjust the ordering of the group by if it thinks it will be more optimal. And an example he showed down here, this is an optimized order where it returned in 3.5 seconds. He then switched the group by order up here compared to the previous query and it ran in about five and a half seconds. So it was slower. So clearly the optimum order is in the first case here. But when he added the patch to it and enabled it. So we set enable group by reordering to on and kept the bad order, postgres optimized it and now the query returns in 3.5 seconds like the faster one. So this is a great improvement. And it's interesting. What he said here is that this particular work on this patch started in 2018, and it took five and a half years to reach this particular commit. So that's a lot of consideration to make sure everything was in place for this enhancement. Definitely hope it gets into Postgres 17, but if you want to learn more, check out this blog post and the last piece of content. Postgresql 17 tracks skipped rows from copy in PGStat progress copy this is from dbiervices.com. I'm not necessarily as excited about this one as the save error two feature of copy that we talked about on last week's scaling postgres episode, because pGSTat progress copy only maintains a row in that table when a copy is actively going on. So for example, down here you can see something is being processed, and eventually it'll tell you how many tuples were skipped. So that's what this new feature does. It shows you how many tuples were skipped because of errors, but then once the copy is done, that row goes away, so you lose that information. So it's only valuable to have if you're actively monitoring a copy going on, you want to know if some error happened. So I definitely prefer the other feature that it actually saves the rows that have the error. I think I would use that a lot more than this particular feature, but great nonetheless. I hope you enjoyed that episode. Be sure to check out scalingpostgres.com where you can find links for all the content discussed, as well as a podcast version of the show and a full transcript while you're there. You can also sign up to receive weekly email notifications of our shows. Thanks, and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 299

Build A GPT In SQL | Scaling Postgres 299

Episode 23

UUIDs, Write Spikes, Distinct, SQL Standard | Scaling Postgres 23

Episode 89

Performance Impressions, Mystery Solving, GiST Performance, K-Anonymity | Scaling Postgres 89

Postgres LLM OS & 30 Times Faster Index Builds | Scaling Postgres 301