Postgres 19 Beta 1 Released | Scaling Postgres 420

[00:00:00] Over the last number of months we've talked about all the different features that should be coming in Postgres 19. [00:00:07] Well, today we have an idea of what might actually land in Postgres 19. I mean things can always be taken away. [00:00:15] But the beta was released today, so we'll see what's in that. But I hope you, your friends, family and co workers continue to do well. [00:00:25] Our first piece of content is PostgreSQL 19. Beta 1 is released and we'll just take a look at the feature highlights. So in the performance area, they're talking about changes to the Asyncio system where it now automatically handles workers. You just give it a min and max and it handles it. They're talking about the new PGPlanNadvice extension that's being released. The fact that you can now do auto vacuuming in parallel. They mentioned the new repack command and repack concurrently to compress your tables and I always see things that I haven't heard of before and I want to know more information. [00:01:03] But it says PostgreSQL 19 shows up to two times better performance on inserts when foreign key checks are present. [00:01:11] So I don't recall any blog post discussing this feature and I actually want to learn more about this, but I haven't looked it up yet. [00:01:19] And various other performance benefits. With regard to the planner, they talk about developer experience. With regard to introducing SQL pgq, which is for doing graph queries, we'll see two blog posts discussing that in this show they mentioned the merge and splitting of partitions. [00:01:38] You can now return rows from an on conflict do select but of course they have many more features. They've enhanced security, added more monitoring observability, and of course continuing to every version improve the logical replication capabilities. [00:01:55] So definitely check this out if you want to learn more. [00:01:58] Next piece of content There was another episode of Postgres FM last week. This one was on Autovacuum and basically they started off by Saying AutoVacuum has one primary job, but then a few other jobs as well. So its main job of course is garbage collections. So with Postgres mvcc, whenever you do an update or delete, that old row is retained and then must be cleaned up via vacuum. So that's essentially garbage collection. It is also responsible for keeping statistics up to date, so it updates those stats so the optimizer can develop an accurate plan for running queries. It also handles TXID wraparound by freezing pages as well as updating the visibility map. So the visibility map is important to optimize index only scans because visibility information is not in the indexes. So it uses the visibility map to determine what's visible and if something is not, it must access the heap. Now, in terms of optimizing Autovacuum, the first thing that Nick mentioned is increasing the number of workers. [00:03:04] And I actually don't do that as a first step when I'm optimizing Autovacuum, the very first thing I do, which he did mention, is increasing the cost limit. Because the cost limit I think by default is 200, but I set that up to thousands. 2000, 3000, 4000. It depends on how active and how large the system is, because that cost limit dictates essentially how fast Autovacuum can process through things. And I only increase the number of workers when it's obvious the three default ones are always busy. So in that scenario, definitely need to increase the amount. I think Nick was mentioning a general guidance. Have as many Autovacuum workers as a quarter of the cores of the system. So if you have a 32 core system, maybe you want eight workers. [00:03:55] But definitely the next most important thing is adjusting the scale factor. So by default it's 20%. So when 20% of the table changes, go ahead and start vacuuming. [00:04:05] Well, really that should be closer to 2% or 1%. And I've even set it lower than 1% in certain cases where there are very large tables being used. [00:04:17] Because if you have the billion row table and the default settings, you essentially have 200 million dead tuples in the table. And they said something to be aware of with PostgreSQL is that if you set your maintenance work memory too high, you actually may risk having an out of memory eventually. Therefore they say be sure to set that Autovacuum work memory to maybe half a gigabyte when you're on Postgres 18 to potentially avoid that issue. But if you want to learn more, definitely listen to the episode or watch the YouTube video down here. [00:04:49] Next piece of content. Multigres version 0.1 Alpha, an operating system for Postgres. [00:04:56] This is from Supabase.com and they're announcing the release of an early alpha of Multigres. So again, this is Vitess for Postgres. Vitess is a scale out Charting solution for MySQL and the intent of Multigres is to holistically manage your Postgres instance, giving you sharding connection pooling, automatic failover and backup orchestration. So it essentially wants to do all the things and they mentioned they do have a multigres operator, so it's intended to be run on Kubernetes and they have an coordinator and operator to be able to run multigres clusters on Kubernetes. [00:05:34] So basically it addresses high availability, it does use native Postgres replication and it uses a consensus based system to handle the high availability. [00:05:45] You can have user defined durability policies and you can add or remove replicas without affecting performance. [00:05:52] And they link to a few blog posts about it. They've also built their own connection pooler that's a part of it. We discussed that I think two shows ago. They have a multi gateway that is the entrance into the multigres system and then a multi puller that sits on each host to handle the back end connections to the instance. [00:06:12] And this handles traffic routing context aware pooling. So you don't set whether it is running in transaction mode or session mode. It just essentially pin session like RDS proxy does. It has per user pools and it does prepared statement consolidation. It handles backups via PG backrest. So it's a good thing that that project is more healthy now. [00:06:35] And this is what alpha means. So they know they have issues that still need to be addressed. [00:06:41] Sharding, which is essentially the flagship feature, is not in this release. So this version does single shard cluster with high availability and pooling. [00:06:51] Future releases are not guaranteed to be backward compatible. So that's important to emphasize. The API is still not yet stable and performance benchmarks are in progress. [00:07:03] But if you're looking for a sharding solution, maybe it's time to evaluate this. [00:07:08] Related to this, there were a couple of blog posts that was put [email protected], the first one is building a truly compatible Postgres proxy. The multigres story. [00:07:19] So they're talking about the work that they did to make the proxy as much like Postgres as possible. [00:07:26] So they were using PGR for doing regression tests and tried to match it to be as much like Postgres as possible. And if you look at the pgregress suite with PgBouncer running in session mode, essentially all but one of the regression tests pass PgBouncer transaction mode. Of course, there's quite a few less at multigres at this point they're at 164 out of 222 regression tests. And then they talk about more details with regard to that. The next blog post again from multigres.com is deploying the multigres operator. So this is using the operator to manage multiple multigras clusters. So you can definitely find out more information about that here. [00:08:14] Next piece of content handling graphs with SQL PGQ in PostgreSQL. This is from cyber.postgresql.com and like was mentioned with the Postgres 19 beta release, they've introduced the ability to create graphs and to query them using graph table. [00:08:32] So this blog post works through an example of that. So he's saying, hey, let's have a social network where we have a person table and a nose table just matching who knows who. [00:08:43] So again we have the person table, which is essentially a vertices in graph speak. And then we have a nose table that basically links one person's ID to another. And this is essentially the edges object in graph speak he inserts the data and then he defines the property graph he's calling social. So you can think of it as a separate object. It is the property graph social. It includes these vertex tables which right now it's just the one table, the person table keyed on the ID and he's calling it person. I suppose you could have a different label and it includes these properties. So you explicitly state which columns should be included. Then we have the edge tables and with the label of nose. And it has additional properties like the vertex tables, which is sense and then it matches the source and the destination. [00:09:37] And this is what a graph query looks like. So select the name from the graph table social and match where P is a person and give me the columns of p name and order it by the name. So this is the output that you see. [00:09:53] So it's not really utilizing the graph, it's just basically doing a select name from person order by name. [00:09:59] But to actually do a graph query you would do a query like this and you're matching P is a person, is knows P2 is a person, include the columns and order by it. So now you can see the relationships between people. So Alice knows Bob and Alice knows Carol and you can do multiple hops. So you can say A is a person, is knows B is a person, is knows C is a person. But you'll notice in some cases it actually loop backs around to itself. So Alice knows Bob and Bob knows Alice. [00:10:34] So in order to filter that down to avoid those types of circular relationships, you can add a where clause where aid is not equal to the cid and you can use explain queries with this. So you can explain this GraphQL and you'll see what it's doing doing here. And it's really just rewriting this graph query into a SQL query that the planner can interpret and run. [00:10:58] So as he says, quote, there's no additional executor nodes or additional things we can see in the execution plan. Postgres simply rewrites the query behind the scenes to make the syntax easier. [00:11:08] So this isn't really building a graph database in Postgres, it's still using the heap. It's just a logical layer on top of your existing data that you can query in different ways, and all it's doing is rewriting the query and still running it against your existing relational data. Another blog post related to this is SQLPGQ in PostgreSQL 19 graph queries without the graph database. This is from thebuild.com and he explains the exact same thing. And then his important point is Postgres can work great for shallow and fixed depth queries like this, but if you have deep and variable length queries, you're probably going to have to reach for a graph database. Like for example, he mentions Neo4j to run really complex graphs. But if you have something relatively simple, you know, shallow, fixed depth, then maybe this feature will work for you. [00:12:07] But check these blog posts out if you want to learn more. Next piece of content when is a function leak Proof? This is from cyberpeck PostgreSQL and he's talking about a function that can essentially leak information that you don't want the user to have. So this is basically from the perspective of a security context and he was working with row level security and actually noticed an issue when looking at, for example explain where you can actually run a query. Say select all from the account where the owner equals duff and assume you are not duff. Well, Explain reports back the fact that it did an index condition, but you can see the rows removed by filter was one. So it's not that you can see the account row level security is working, but Explain has essentially leaked the fact that Duff does exist, so leakproof to your functions is trying to avoid this problem. [00:13:04] But of course the downside of doing that is worse performance, because it all depends on when different functions must be evaluated and whether indexes can be used or not. So there's definitely a trade off between performance and security. [00:13:20] But with regard to the Explain leak he says, quote one should not consider row level security as a protection against users that run arbitrary SQL statements against the database. [00:13:32] Rather, the safe way to use row level security is to use it in the context of an application where the application controls the SQL statements. [00:13:41] So don't think you can just give your users raw SQL access with row level security and everything will be fine. The type of SQL that runs should be also safeguarded, but if you want to learn more, definitely check out this blog post Next piece of content same query 3 results benchmarking paradedb and Postgres HTML fts or full text search this is from paradedb.com and they took a single query and ran it three different ways. [00:14:11] The first way they ran it was just running one query after another with the same search value and looked at the delta between paradedb again using elastic like search I think a BM25 index compared to a GIN index using Postgres full text search and the delta was about 10% different. So you would say, well paradedb's not that fast as one might expect. [00:14:39] Then they just changed the number of terms, but now it's searching up TO I think 40 different terms that varied highly in how selective they were in given documents. And that delta increased significantly between paradedB and full text search and of course paradedB came up on top over 3000 queries per second versus 100 for Postgres full text search. And then they took a look at parallel queries. So 50 queries per second and the delta increased even further to a 47x gap. And you can tell it's also enormously variable with the full text search as well, also using a lot more memory. [00:15:22] So definitely interesting set of benchmarks. And check this out if you want to learn more. [00:15:27] Next piece of content pgstatstatements everything it tells you this is from boringsql.com and of course this is arguably the most popular extension in Postgres and it emphasizes what it essentially stores is query identifiers and how often they occur. So it doesn't record the fact that you looked for a user whose ID equals 42. [00:15:53] It stores where the ID was equal to some variable. [00:15:56] So this is the query identifier and it records how many times that happened. And he also mentions the N list because before Postgres 18 each of these would be considered different queries, but with the enhancements in 18 now it's just considered one and it records how many calls happen. So that's definitely preferred compared to this. [00:16:19] But he covers a lot more in this blog post as well if you want to check it out. Next post related to that Also from boring SQL is PGStatStatements everything it can't so what it can't tell you anyfirst says, well it can only store a certain number of queries or statements. By default it's 5,000, but of course that can be adjusted the scenarios where the query text can vanish, particularly if the file grows large. [00:16:47] It doesn't record if you have failed queries, and its statistics are only per node. So you have to examine each node's pgstat statements to understand what query patterns are going on, and then goes over various configuration options. And this is the big reason why a lot of these monitoring platforms actually query pgstat statements on a regular basis to actually build that history and to give you what you can't get just running the PGStatStatements extension. But check these out if you want to learn more. Next piece of content related to PGStats statements is that panlyze.com has been doing a series about PGSTAT statements. So if you want to learn more about that, this is the latest in the series Postgres in production. Deep dive into PGStats statements, so feel free to check this out along with the companion video. [00:17:38] Next piece of content looking forward to PostgreSQL the new repack command. This is from pgedge.com and he talks about the command, but he makes a very important point here is about what he calls an ounce of prevention. So much like the Autovacuum podcast that Postgres FN mentioned, Optimizing Vacuum can help prevent you from having to run Repack as often, because if you optimize your Autovacuum settings, you're going to minimize bloat. [00:18:07] So he says definitely focus on that first. And he has his own set of guidelines that he follows when optimizing Vacuum, so you can check this out for that too. [00:18:18] And the last piece of content hacking workshop for June July 2026. This is from arhas.blogspot.com and so normally this is monthly, but it looks like they're doing one talk on additional I.O. observability and Postgres with PGSTAT I.O. by Melanie Plageman so you can use this form to sign up for that. I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content discussed, as well as sign up to receive weekly notifications of each episode. There you can find an audio version of the show, as well as a full transcript. Thanks. I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 296

State Of Postgres 2023 | Scaling Postgres 296

Episode 79

PgBouncer SCRAM, Art of PostgreSQL, Distributed Time-Series, Window Functions | Scaling Postgres 79

Episode 48

Microsoft Acquires Citus Data, Split WAL, Maintenance Work Mem | Scaling Postgres 48