DB LLM Agents? | Scaling Postgres 382

[00:00:00] I know AI and large language models are all the rage now and I have started using them for some programming work I'm doing. I've even done DBA related work with them. But I still have a huge problem trusting them with anything important. Like for example, when I'm programming I always have the code committed. Then I say okay, go ahead and try figuring out this problem or coding this problem and and then if they go down a bad path, I just simply restore the files and try again. [00:00:33] And that's the thing that's different from most programs that have been written where they did the same thing every time and you knew it wouldn't vary from that path unless you installed a new library or something like that that had different behavior. It always did the exact same thing. So it was quite easy to automate things. [00:00:52] But these LLMs act frankly more like humans and from time to time they may choose to do something different than what you instructed them to do. So for repeated processes, to my thinking, it's really hard to rely on them. Or you have to finally define their role to make sure that them going off the rails doesn't cause more problems. But still, people are looking at AI and LLMs as a way to do as they say, DBA agents and we'll talk about that more today. Also, be sure to stay till the end of my consulting corner where I talk about some progress we've made with regard to upgrade rollbacks, but I hope you, your friends, family and coworkers continue to do well. Our first piece of content is from DBA to DBagents. This is from Zada I.O. and apparently Zada has been building an LLM based Postgres agent agent and this is a blog post about the presentation that you can find here at this link. But this particular blog post goes over the author's history and what some typical responsibilities of DBA work typically looks like in terms of installing and configuring postgres, monitoring, tuning queries, doing upgrades, migrations, things of that nature and that goes through different cycles of evolution and then looking upon how how potentially an LLM agent could help. [00:02:17] And they've tried to look at it from the perspective of a playbook. So they have a general monitoring playbook. So a list of tasks or duties that this should do an investigate slow queries playbook investigate high CPU usage or low memory or high connection count. Just a set of playbooks and guidelines to look into to try and diagnose what the problem is, as well as a tune settings and a daily summary one now again keeping the guardrails on the LLMs. They're talking about making changes and doing approval workflows through GitHub, so it wouldn't actually make a change, but it would do a pull request and then a human could choose whether to merge it or not. They're also talking about custom workflows through MCP servers and they also talk about safer operations by introducing guardrails and gated workflows. Now they have a headline here where AI shines, they're saying when explaining why query is slow in a natural language. Yes, that's definitely advantageous, suggesting next steps for developers to take. Definitely see that. And supporting devs with SQL guidance, rewriting queries, better index usage or clarifying confusing error messages. Although with a lot of this do we really need a DBA agent? Because I can do all of this from a ChatGPT prompt, from a Gemini prompt, or even Claude if it has insight to the schema, which it should be able to do. [00:03:38] And in terms of where it struggles is multi step schema changes or managing costs when LL queries add up. And consistent behavior is also a challenge. But moving forward with these AI agents, the biggest barriers they're suggesting is privacy because yes, they would need access to a lot of different data and given that LLM behavior is inconsistent, we have to be careful how much we trust them. And then costs is definitely not free. [00:04:05] It's usually cheaper than an employee, but still the costs are up there. And then safety. Like what if you have an LLM run a destructive command in production? So it's definitely early days with regard to these agents and I'm not quite sure where they're going to fit in yet, but definitely companies are working this to try and figure it out. If you want to learn more about this, definitely check out this blog post. [00:04:28] Next piece of content Partition table Statistics this is from cybertech postgresql.com and he's talking about an issue where statistics are not normally run on partitioned tables. [00:04:42] So if you have a partitioned table that has multiple partitions, all the statistics for each partition table does get indexed because it has data in it and vacuum runs against it. But the actual parent table, the partition table, has no data in it. It's basically a logical construct similar to a view. It has no data, so autovacuum doesn't run against it and these statistics are not kept up to date. So this is basically a gap in postgres right now. And he showed an example where he created a partitioned table called tab and then he created two partitions tab even tab odd. And then if you wait for autovacuum to run, it'll run in vacuum analyze, but it only runs it on the partition tables, but not the partition table. So if you try to look at the stats for that parent table, there aren't any zero rows. [00:05:34] And this is from the documentation partition tables, the parents do not directly store tuples and consequently are not processed by autovacuum. And that quote can cause suboptimal plans for queries that reference partition table statistics. So basically you need to run analyze on partition tables when they're first populated and again whenever the distribution of data in the partitions changes significantly. And I actually heard of a story where someone did a logical replication upgrade and they mostly moved the tables over by partition tables and they analyzed all the partition tables but not the partition table. And when they went live, they actually had some query slowdowns. But once they analyzed the partition tables, then performance came back. And that's why I always do an analyze full on a new database when I'm doing a logical replication upgrade. But he shows an example here where even though the statistics aren't there, it's pretty accurate in its estimate. And then he looks at an example that has a join where the estimates are wildly off. Like Postgres expects to find 500,000 rows when there are only 100 rows, so it's off by a factor of 5,000. [00:06:44] But once you analyze that partition table and then run this query again, it estimates 100 rows and it delivers 100 rows. So that fixes the inaccurate statistics. [00:06:56] So basically the recommendation is to run analyze on these partition tables on some sort of a periodic basis. Maybe it's once a week, maybe it's once a day. But he also follows up and says, why are partition tables excluded from autovacuum? And he says, well, the answer is probably no one has got around to implementing it yet. And he comes up with a few ideas if people wanted to do a patch on that. But check this out if you're interested. Next piece of content, PostgreSQL 18 Release Candidate 1 is released. This is from PostgreSQL.org so we are about three weeks from the general release of Postgres 18, so that's definitely exciting. So if you want to upgrade soon, definitely download and check out the release candidate. [00:07:40] Next piece of content, primary keys and sharded databases. This is from pgdog.dev and pgdog is basically a pooler that does sharding, basically trying to do turnkey sharding and primary keys where you're using a sequence can be a problem with that because each sequence is created in each independent database. Well, how do you avoid sequence collisions when you're sharding out to multiple independent database systems? [00:08:06] He said, well, you could move to UUIDs, but this requires schema changes, which kind of defeats the purpose of a turnkey solution. And if you're not using UUID version 7, you're going to be running into performance problems because this uses time order UUIDs. He said, well, you could use client side ID generation, but that is its own set of problems because now you may have a complex single point of failure in your architecture, this high ID generation process, unless your application servers are doing it independently. [00:08:37] But the solution that he came up with is sharded sequences. [00:08:43] So his actual sharding solution uses the partitioning algorithms for sharding. So the same way that partitioning works to target data to a particular partition, he's just using that, but shards it across database systems. So this is the information that is used by the functions in hash based partitioning. In postgres, where you have the postgres identifier of the partition table, total number of partitions, the partition number and the value to be inserted into the partition. Well, he essentially needs the same things to do his sharding. And for this purpose he creates a schema in each managed postgres database by pgdog. So there's a pgdog schema and he has a config table to write the config values, such as the shard number, total number of shards, a few validator tables and functions to get the next UUID or ID sequence. So what they want to do is have each sequence that's generated be appropriate for that shard that it's running on. So does it pass the satisfies hash partition? So if a sequence is generated, can it go into this database system based upon how it's sharded? So if it generates a sequence that's not valid for the database system on which it's running, it needs to go to the next sequence and then the next one until it finds one that satisfies the hashing algorithm. Basically, he's brute forcing the sequence. If a new data row comes in, needs to be inserted, it needs a sequence, it gets the next value of the sequence and keeps trying until it satisfies the hash partition algorithm, then at that point it can insert the data. I would say this is a clever way to generate sequences on multiple independent shards and and not have them know what the sequence value is for any other shard. Now of course the disadvantage is you're having to call nextvalue multiple times to get one appropriate for the shard that it's running on. But he shows an example of the performance difference. So calling Nextval on a sequence already takes 0.2 milliseconds. So if you have one shard it takes a little over double the time. But it's still it's less than 0.04 milliseconds. But then two shards, three shards, six shards. Only then is it up to 0.04 milliseconds. Many tested 12 and 24 and 48, but still at 48 shards, it's less than 0.08 milliseconds, which I guess is like 80 nanoseconds, you know. And this doesn't require any disk reading or writing, it doesn't require any network communication. It's all just happening locally in memory. And you can also cache sequences so using a cache value of 1000. So grab a thousand sequences that you can then use. Then he saw a slight reduction. Maybe it's 25% better performance when you actually cache the multiple sequences. And he does the same thing for UUIDs in terms of keep generating them until you find one that satisfies that hash partitioning requirement. Although I would think this might be slower than getting a new sequence. But I thought this was an interesting way to handle primary keys with sequences when you're using multiple independent sharded database systems. And if you want to learn more, definitely check this out. Next piece of content. Pgdog adds support for Rust plugins this is from pgdog.dev as well. And the pgdog project is now supporting plugins written in Rust that you can plug right into the query router to be able to change sharding and load balancing algorithms. Now that's not something I would touch with a ten foot pole, but if you have specific use cases this could be very advantageous. And it definitely goes towards the open nature of this project. So if you want to learn more, definitely check it out. Next piece of content. There was another episode of Postgres FM last week. This one was on disks and the first discussion was should we still be calling them disks? I mean aren't they mostly chips on a circuit board now? But I guess you could call this storage. [00:12:43] But definitely they emphasize the importance of speedy storage is to your database system, particularly if your database doesn't fit all in memory or the shared buffers. You'll of course need to access the storage area to pull over the files to use as well as fast storage allows wall generation to keep up if you have a ton of inserts going on. They talked about the importance of turning on track I O timing in order to get good reporting on storage performance, and they also had a fair amount of discussion between local disks like local NVMe SSDs for example, compared to network storage solutions, things like elastic block storage at AWS or other network storage solutions and how the advantage with local storage is you get super high iops, very low latency. But one disadvantage is it's easy to run out of that space space. So upgrading storage can be more difficult to do in terms of the network storage. It usually gives you features like snapshots which are fantastic. You can detach storage from one instance, put it on another instance, you can clone a given storage volume pretty easily. You can easily do auto scaling of a volume, so increasing disk is really quite easy to do. The downside is limitations on the iops or you have to pay through the nose to get it, as well as increased latency compared to local storage. But if you want to hear all about their discussion, you can listen to the episode here or watch the YouTube video down here. [00:14:13] Next piece of Content PGTDE can now encrypt your wall on prod. This is from percona.community and they recently released PGTDE which stands for Transparent data encryption and with the most recent release now it can encrypt your wall records as well because that's something it didn't do at the time. Now this is an extension, but you still need to use Percona server for PostgreSQL because it does have to be patched to do this type of encryption. Now they say our long term goal is to ensure that this extension will work with community based postgres, but until the patches they've developed get into postgres, you do have to use a patched postgres if you want this particular feature. But still this tool still has some known limitations. First is there's only certain backup restore mechanisms that work. For example it works with PG backrest and some PG based backups can work depending on how you're using it, but things like Barman doesn't really work and native logical replication doesn't work as well. But check this out if you want to learn more. Next piece of content who logged to system from multiple countries in two hours? This is from depeche.com and if you want an example of query optimization, he took exactly this example and tried to figure out the best way to run the query. Multiple different ways to find the highest performing one. So if you're interested in this type of optimization showcase, definitely check it out. Next piece of content exploration CNPG Kubectl plugin this is from cybertech postgresql.com and they've done a number of posts on Cloud Native PG. Well, now they're looking at using Cloud Native PG's plugin with Kubectl to control your Postgres systems with Cloud Native PG. So if you're interested in that, you can definitely check out this blog post. [00:16:07] Next piece of content Run Jepsen against Cloud native PG to see sync replication prevent data loss. This is from ardentperf.com and this is an example of using Jepsen, which he says is a testing framework for distributed systems that verify safety guarantees by subjecting clusters to different types of faults. So with it, he demonstrated that without synchronous replication, it is possible to get data loss when you lose, for example, say, the primary, whereas if you enable synchronous replication, it avoids that. So if you want to look at how he set up the tool and did these tests, definitely check out this blog post. [00:16:42] And the last piece of content is as a new postgres podcast downunder. So you can see the number of different interviews that have been posted already. But if you want to check it out, you can go to the site pgdownunder.com and now it's time for my consulting corner. So one of the upgrade projects we're doing is using AWS's RDS Blue Green deployments to do an upgrade of a database system. Now, there's not really an easy way to fail back using that solution, so, so we basically came up with one. So right before we trigger the switchover, this particular application we did want to stop all the traffic and stop all the connections. So we go ahead and do that, and then right before we trigger the switchover, we create replication slots on the green cluster, what will be the future cluster. So basically we're just going to use standard native logical replication to keep the new cluster in sync with what the old one was. So with those slots created, we trigger the switchover, and now that's the new database. We start traffic to the database again, and we're now on the new version. But then we can go to the old cluster and create a subscription against those replication slots and publications that we created on the new cluster. And we don't have to copy over any data because those slots were created when there were no connections happening and no data was being changed. So now if we need to do a failover event, we hope not to use it, but if we need to do a failover we can stop the traffic again, point the app drop connections, make sure that there's no replication lag, transfer the sequences, and then point the app to the new database and start the app. So now we have essentially failed back to the old version again if we need to. So this is a way to still use blue green deployments for an upgrade, but this is a fail back solution we devised in case we need it again. We hope we don't have to, but it's there as an option if we need it. I hope someone will find that beneficial I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content discussed, as well as sign up to receive weekly notifications of each episode There. You can also find an audio version of the show, as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 167

Interpreting Explain Analyze, Bulk Loading Data, Limiting Update & Delete, pg_buffercache | Scaling Postgres 167

Episode 83

Postgres 12, Window Frames, Index Types, Synchronous Replication | Scaling Postgres 83

Episode 53

Query Planner, BRIN Indexes, Approximate Algorithms, Delayed Replication | Scaling Postgres 53