Postmaster Does Not Scale! | Scaling Postgres 404

[00:00:00] So a couple of weeks ago we covered an OpenAI post where they're supporting 800 million users on one Postgres primary. And this week we're going to have a story that talks about how not just a single primary doesn't scale, but it's the postmaster process that is not scaling. Also we are going to cover that there have been new postgres releases with about four security vulnerabilities, so be sure to pay attention to that, but I hope you, your friends, family and co workers continue to do well. Our first piece of content is postgres Postmaster does not scale. This is from Recall AI. And they send meeting bots to take notes for people doing online meetings and their resource utilization looks like this. And these are the number of EC2 instances in their fleet and as you can see, around the start of the hour, but it could happen in the half hour as well. They have an explosion of EC2 instances that they deploy to join online meetings to take notes. And you can see the highest peak is around 30,000 EC2 instances here. And they've had to deal with different bottlenecks, they said every level of their infrastructure and they ran into a problem where they were seeing slow connections to Postgres. [00:01:21] So these EC2 instances would come up, but they could not connect to Postgres and sometimes the wait was as long as 10 seconds. [00:01:29] Now, they tried diagnosing this in their hosted environment, but they just couldn't do it. So they actually had to set up their own dedicated EC2 instances, so a non hosted platform in order to diagnose what was going wrong. So they had a Redis coordinator that fired up 3000 EC2 instances that contacted a Postgres DB and what they discovered was a problem with the postmaster. So this runs as a process, so it's single threaded and it's a loop that processes events synchronously and it's responsible for forking its process to generate new backends as new connections come in. [00:02:05] Also reap those and also to launch background workers for doing parallel processing and what they suspected was going on, that the burst of new connections would temporarily overwhelm the the Postmaster loop, causing it to lag behind the queue of the incoming connections. Now, I should say they did look at the stats of the machine and it looked relatively normal, but the devil's in the details because they were seeing 100% CPU utilization for the Postgres Postmaster. So that was where the burden was. And you can see by the scrap, as they ramped up the connections per second or once they got up to around 1400 connections per second to the database, the CPU utilization hit 100% for that core that was running the postgres postmaster process. [00:02:56] And when they looked at a flang graph using perf, they saw most of the time was spent spawning and reaping backends. Now, they did say if you configure huge pages, you can mitigate some of this. So they saw a 20% throughput increase by utilizing huge pages. So that's something to do if you suspect you might be having this issue. But what they suspected also was going on is having parallel workers being kicked off. So they say a high background worker churn rate also puts pressure on the postmaster main loop. Now, again, in the original graph we showed at the top here, the problems that they were experiencing weren't on every hour, so it was still a bit sporadic. So what they think was actually happening and is that the parallel queries were a confounding factor. So when they're bringing up all these EC2 instances creating lots of connections, if they had a lot of parallel queries going on at the same time, that would cause saturation of the postmaster process. [00:03:56] And during one of the incidents, they did see a spike in background worker shutdown here. So they said we were able to simulate a high background worker churn in parallel with with a connection flood and observed a large decrease in connection throughput from the postmaster. So to resolve the issue, what they're planning to do or have already done is implementing jitter in the fleet of EC2 instances so that not all of them are coming up at precisely the right time. There's a little bit of variability. And then secondly, eliminating bursts of parallel queries from their API servers. Any basically saying this is probably another reason why everyone is running the connection pooler and it might be the postmaster as one of the main reasons, or at least a significant reason, why postgres connections are slow. And apparently he said they did a rudimentary experiment showing that they could linearly increase connection throughput by adding additional postmasters on the same host. And his final comment is, as an aside, it's mildly absurd that none of the databases of service or monitoring tools providers have observability into postmaster contention. But I thought this was a great post. Definitely encourage you to check it out. [00:05:08] Next piece of content, PostgreSQL, 18.2, 17.8, 16.12, 15.16 and 14.21 are released. So this actually has five security vulnerabilities that were patched now most of the higher base score ones of 8.8 are due to extension related but still if you're using those extensions, this is important. Like there's a vulnerability with the int array extension that's an 8.8 basically input validation issues. There's also an issue with the PGCrypto extension with regard to a heap buffer overflow situation. Again an 8.8. There's also an issue with the PG trigram extension with a heap buffer overflow, but there is another 8.8 that is not related to an extension, but it has missing validation of multibyte character length can execute arbitrary code and that code will run as the operating system user running the database. So this is a text manipulation by a database user. So something that should definitely be patched as soon as you're able to. [00:06:14] And of course there are a host of other bug fixes that are present, but as you should always do, check out the updating section and they do say you do need to re index any indexes that are on ltree columns due to the changes that were made if you don't use the LIBC collation provider. So if you want more details about this definitely check out this blog post. [00:06:38] Next piece of content packs the cache performance you are looking for. This is from mydbanotebook.org and they're considering the situation of when you have a query such as this when you're looking for a single column from a table where that value was higher than 30. Now this is an arbitrary example because you normally don't use one column in both the select and the where. But this kind of gives you a hint about what it's talking about. It's basically column storage of data. [00:07:09] So right now postgres uses a technique called NSM or the N ary storage model. So for each 8 kilobyte page of Postgres, you have line pointers here that points to the data that's at the end of the file that contains each tuple or each row of data, and you add the free space in the middle. So as new data gets added, the free space gets reduced. [00:07:34] So with a query like this, where you're only looking for a single column, you're going to have to address each row and probably throw away a fair amount of bytes on either side of where the age is. Because as she says her quote, Modern CPUs don't fetch individual bytes. They load cache lines typically 64 bytes at a time, although sometimes it's moving up to 128. So essentially this is just dead information that's not needed. And this is very cache inefficient. So she says, you know, quote, we wanted age, we got everything. [00:08:08] That's a 94% cash pollution. [00:08:12] So if you're pulling a million rows, you would get 56 megabytes of wasted bandwidth and 4 megabytes of that, which is useful. But she says there's another way to lay out the data, and that is PACs, which I didn't see what this stands for, but it's basically doing column storage within each page. [00:08:32] So all the rows contained within this page, they have a PAX header here that references each column's data. So here you have essentially an array of your IDs, an array of the names, array of the ages, array of the cities and array of the salaries. So again, when you're pulling back more than one record and you're only looking for the age, a lot of that could fit into that 64 byte cache which she shows down here. And the great thing about this with regard to updating, because you may have heard of column storage, how it's really hard to update. Well, with the original PAX concept that was invented in 2001, seems like you can update the page just using MVCC. [00:09:17] And she looks at some differences between PACS, what Postgres needs and how parquet files relate to that. Now, if you're thinking through this, you're probably thinking, well, what about when I want the whole record or the whole row? Well, then PACS is less efficient with that. So she has this other chart that says good for oltp. For nsm, what we have today, it's a definite yes. For pacs, it's pretty good. So you're going to be losing some performance with it. Like she theorizes, Maybe it's a 5% cost burden, but when you're talking about OLAP, NSM is not really great for that. But with PACS it becomes pretty good. And she does have a section here that says where packs falls flat. And she says when you have narrow tables less than eight columns, it's not really worth it. If you have a lot of select every column queries, it's probably going to have more overhead than is beneficial. Or if you have heavy random access via indexes, where the index lookup reads the full tuple. But I look at this and I think it would be great if someone was able to construct a storage engine using a PAX technique. So that way you could store say, a subset of your tables where you know the query patterns in and would be optimized for more column nerd data layout. That could be very interesting, but check this out if you want to learn more. Next piece of content do PostgreSQL subtransactions hurt performance? This is from ShaneBorden.com and he's talking about the issues that sub transactions can have. [00:10:52] So he says you can call nested procedures relatively free. No new txids are used. They basically use the parent transaction. You can use begin to end blocks as long as there's no exceptions in them. That's great. Now commits are expensive because they do burn a transaction id, but of course you must commit the problem is exceptions, particularly if you have nested exceptions or nested transactions or anything like that because they create sub transactions which become a performance killer. [00:11:27] Now he has an example script that he's built here to look at some different things, including sub transaction loops at a depth of 48 and 128 because there is a local memory for 64 sub transactions but if you exceed that then the performance craters and he shows some of the timings on some of this. So basically if you use exceptions there is a risk because those sub transactions are generated. It does generate wall to keep track of them. But what's really bad is the nesting of sub transactions, particularly in deeper layers, especially if you do something recursive with procedures that have exception tracking within it. And he has a scenario here. He says if you need to handle errors in a bulk load like you need to insert 10,000 rows and skip the ones that fail, well, you can use copy for that. But if not, if you're using procedures to do it, you know, validate the data on the application ahead of time so the database doesn't even see it. Secondly, don't wrap every insert in an exception block. [00:12:30] Third, use exception logic purposefully and avoid the need for catch all like when others and use insert on conflict do nothing because again that avoids the exception handling. But check this out if you want to learn more. Next piece of content waiting for Postgres 19 better planner hints with Path Generation Strategies this is from pganalyze.com so there's been a path generation strategies change and a proposal for a PGPlanNadvice extension. [00:13:01] So there has been an extension called pghintplan where you can give hints to the planner, but there haven't been really good hooks into it. And Lucas mentioned in his video that it actually has code from postgres to try to handle some of this, but with his pants it looks like they're building better hooks and to support some of this. And the Pghint plan extension could potentially lose 2,500 lines of code because of it. And you may be wondering what is PGPlannAdvice? [00:13:30] Well, Lucas says the way that he understands it is that quote, the goal is that plan advice for a given query plan can emit a string of advice and if you pass that string back to postgres you get the same plan. [00:13:45] So it's basically a way to get repeatable plans or prevent plan changes on the fly. And that would be great to have in postgres. I believe that AWS Aurora RDS does support plan locking of some sort, so being able to prevent plan switching could be a great enhancement if that's possible. But if you want to know more about the details, definitely check out the blog Post and the YouTube video. [00:14:13] Next piece of Content Hydra Joins Supabase this is from Supabase.com and the co curator of Hydra, Joe has joined the Supabase team and Hydra co developed pgduckdb. [00:14:27] So basically it looks like this is an acquisition to round out the analytics capabilities of the Supabase platform, but they are calling it an open warehouse architecture and they do say that everything will stay open source and they'll help Mother Duck with pgduckdb contributions. [00:14:46] So definitely check this out if you're interested. Next Piece of content Reading Buffer Statistics and Explain Output this is from boringsql.com and this was a great post that explains when you're running explain analyze buffers on a query you get the buffer output as shown here. When the query is running it shows how many hits you have, how many reads you have, and it also shows the buffers that have been used with regard to planning. And this blog post goes through and explains what each of these mean. [00:15:17] So again, great resource if you want to learn more, definitely check out this blog post. [00:15:22] Next piece of content EBPF tracing of PostgreSQL spinlocks this is from I'm struggling to read the URL here, so I'm just going to say this is from Jan's website and blog. Now we don't normally talk about spinlocks because they are more for internal management of postgres structures. [00:15:41] Like he says, quote spinlocks are intended for very short term protection of shared structures. So it doesn't do any row locking, table locking, anything like that. It's for the management of the postgres engine itself. And this article explains what spinlocks are, how they're implemented in postgres and also a new tool called PG Spinlock Tracer that enables you to understand what's going on with them. So check this out if you're interested Next Piece of content recreating PlanetScale's PG Strict in Rust A build log. This is from saybackend.com and PG Strict is an extension and the one developed by PlanetScale is not open source, but the person of this blog post wrote an open source one. What it does is it requires a where clause when you're running updates or delete statements because you're updating the whole table if you forget that where or you're deleting the entire table if you forget that where so this goes through how he designed it and how he wrote it in Rust. So if you're interested in that, definitely check out this blog post and the last piece of content. There's a new website called Postgreslocksexplain.com and this is a pretty interesting education about how locks work in postgres. And if you are interested in this very interactive way of learning about postgres locks, definitely check it out. I hope you enjoyed this episode. Be sure to check out scalingpostgrows.com where you can find links to all the content discussed, as well as sign up to receive weekly notifications of each episode. You can also find an audio version of the show as well as a full transcript. Thanks. I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 238

Slow Updates, Lateral Joins, Serial To Identity, PG 15 Changes | Scaling Postgres 238

Episode 289

Postgres Scheduling Opportunities | Scaling Postgres 289

Episode 358

Postgres Conference 2025 | Scaling Postgres 358