1.2 Million Messages Per Second | Scaling Postgres 390

[00:00:00] For probably the last year you've heard this adage of just use postgres or Postgres is enough. [00:00:08] Basically whenever you have a particular data need, whether it's full text search or whether it's a queuing system or a pub sub system, a lot of times developers grab a specialized tool to meet each of those needs. But the reality is that postgres can do each of these roles and can get you pretty far until you reach a certain scale. [00:00:31] Another interesting example of this is the release of Ruby on Rails 8 that was just under a year ago. They released three libraries all offering database backed solutions that previously used Redis. So for example, they released Solid Queue, which is basically a queuing system that runs on the database. They released Solid Cache which is a caching solution that runs on databases as opposed to a memory store like Redis. And then thirdly they released Solid Cable which is how they were doing real time websockets. Well now it uses the database as opposed to Redis. So we're going to talk about an interesting blog post this week that actually analyzes the performance of postgres from a queuing perspective and a pub sub perspective. And I think it's pretty interesting. And also be sure to stay till the end where in my consulting corner I'll just talk about a few things I've been working on, but I hope you, your friends, family and co workers continue to do well. [00:01:32] Our first piece of content is Kafka is fast, I'll use postgres and this is similar to a blog post that was released not too long ago that said Redis is fast, I'll use postgres and he compared the performance well this is someone who is actually really knowledgeable about Kafka and this topic Partition is actually a blog about Kafka. But what he decided to do is analyze how fast you could get postgres working in a pub sub capacity and a queuing capacity. And he does have the opinion that you should just use postgres up to a certain scale. So he wanted to see where that scale was. Now he does admit that he's not a postgres expert, so he did the best he could setting all of this up, but I thought it was pretty interesting nonetheless. [00:02:23] And he talks about two trends he's seeing right now. One is the small data movement where people are realizing things are not that big when you can get 128 core 4 terabyte of RAM instance from AWS. The second trend is the just use postgres for everything and here he's talking about different dedicated tools that you can actually do within Postgres. And then the gray highlighted area, he lists features or extensions that could you that capability. [00:02:51] So first he examined Postgres as a pub sub, which basically means, you know, one topic is published once, but then fanned out to many different subscribers. And what he did is actually wrote his own library for doing this pub sub assessment because he wasn't familiar with a lot of popular libraries that did it. So he described the library he set up to test it as well as his different table setup. And he did use partitioning. So he did break up the data into partitions. He didn't use Postgres partitioning, he just did it manually, as you were. You can think of it as list partitioning, I suppose. [00:03:31] Now I do see how when he's doing the writes, he basically does an update for every write, which definitely could slow things down. And then here are the different reads that are happening and, and even reads have to do updates to update the offset here. And then we get down into the results. He looked at a 4 CPU single node and he was able to get a write message rate of around 5,000 messages per second, write throughput of 4.8 megabytes per second, and a read message rate of 25,000 messages per second, or about 24.6 megabytes per second. And he said, you know, quote, these results are pretty good. It's funny to think that the majority of people run a complex distributed system like Kafka for similar workloads. [00:04:17] Then he took a look at a trinode setup. So it looks like he has is using one sync replica and one potential replica, which I assume he means it's asynchronous. And his assessment of these results were, now these are astonishing results. Throughput was not impacted at all. Latency increased, but not extremely. So basically having this synchronous replica didn't really have a negative impact for basically the same performance. And he said it's pretty impressive that a 3 node postgres cluster can pretty easily sustain a very common Kafka workload. Five megabytes of ingest, 25 megabytes of egress, and for just $11,500 per year, whereas a Kafka vendor will cost you at least $50,000 a year. And he said it's actually even more than that because typically you use compression when working with Kafka that typically quadruples your throughput. So this Postgres setup may be able to handle 20 megabytes per second of ingress and 100 megabytes per second of egress. [00:05:22] And then he also tried a 96 node CPU and the write message rate was 240,000 messages per second and 1.2 million messages per second being read, which is incredibly impressive. But what was even more interesting is that the server was only 10% idle. So he said the bottleneck was the write rate per partition, so it wasn't necessarily the machine holding it back, and it could be even higher than it was. So he says these results are pretty good as well. But the 96 CPU was way overkill for achieving these types of results because again, the CPU was only at 10%, so he could have had a much smaller machine and still performed similarly, apparently. Then he looked at using Postgres as a queuing system. And even though he says Postgres has a popular PGMG library, he still decided to write his own queuing system using insert, select for update, skip lock to lock a row and get a job, process the job, and then mark it as done by doing an update. [00:06:25] But he had his Q table here, a Q archive here, very simple, writes, and then this is how we did the read, select the row, deleted from it, and then insert it into the queue. Now here he found postgres wasn't good at handling client count. So the bottleneck in this setup was the read clients. And this is from just a single node machine. And he was using 15 reader clients. He tried to boost that up, but each client couldn't read more than 192 messages per second, and the server kept it around 60% CPU. The tri node setup was pretty similar. And when he tried the 96 virtual CPU instance, he did get almost tenfold the message rate. But still, he wasn't that impressed with the performance because he was seeing a bottleneck in the single table approach. But going back to the original point of this blog post is how many apps are going to need more than 20,000 messages per second? And not to say they don't exist. They exist. It's just at that point, you could scale to use Kafka or some other solution. And in terms of answering the question, should you use postgres, he says, quote, most of the time, yes, you should always default to Postgres until the constraints prove you wrong. And he talks a lot about reasons to back that up. He's proposing using a minimal viable infrastructure, basically choosing good enough technology. And when the pressures mount, then you can go ahead and scale, because he says, you know, number one, Postgres does scale. For example, he says as of 2025, OpenAI was still using a single writer and multiple readers for their Postgres database and they were the quote, fastest startup to reach 100 million users. And secondly, basically you have more time to scale than you think, because when you factor in annual growth and how many years it takes to 10x scale, you know, even at a 200% annual growth, which is absolutely insane, it still takes you two years. So even though you choose a solution such as postgres, you do have time to switch to a more performant one when needed. Anyway, I thought this blog post was impressive and fascinating and I definitely encourage you to check it out. [00:08:37] Next piece of content, Temporal joins. This is from crunchydata.com and by temporal joins what he means is actually joining to another table and just grabbing one or a few rows from it. Specifically this example is looking at the nth row of a second table and I guess they call it temporal because it is ordering that second table's rows by a particular date to pull over the data. Now Postgres has distinct on to do this pretty easily, particularly if you're choosing the first or the last row row of something. However, if you need to get a specific end number, he has a very simple way to do it with queries without using distinct on. So basically he's using a CTE to get the max second table. [00:09:26] So he is using a window function rownumber over partitioning by the user ID and order by createdat descending ID descending. So basically using the date and then ID as a second way to do ordering and, and then he's choosing where the row order, which is this row here equals two. So he's looking for the second ordered instance of that data per user and he simply joins it to the users table. So that's basically a quick and simple way to do what he calls temporal joins. [00:09:56] Next piece of content explaining IPC sync rep Postgres sync replication is not actually sync replication. This is from ardentperf.com and basically says you, you know, synchronous replication is not actually synchronous because quote, the client issues a commit and then it pauses until it knows that transaction is replicated. Then it can continue. But the reality is, is that the primary writer database doesn't have to wait for the replicas to catch up. It can still do all sorts of things such as vacuums or copy commands or all sorts of things. It can actually generate tons of walls while it's waiting for this one transaction to be confirmed. Now, the benefit of this of course, is throughput and performance, but it means that workload is actually a bit asynchronous. And he says this can cause a problem if all this other work is happening as the replication stream begins to lag because the amount of disk needed for wall actually starts to increase. [00:10:53] And because of these reasons, this is also why it is not advisable to set autovacuumvacuumcostdelay to zero because because it unleashes vacuum running at full speed generated massive and burstive amounts of wall. And a lot of this exhibits itself by seeing wait events of IPC sync rep. And he says if you're seeing this, you should basically analyze your wall activity to see if something like this might be happening to you. [00:11:22] So if you want to learn more, definitely check out this blog post. [00:11:25] Next piece of content troubleshooting PostgreSQL logical replication working with LSNs this is from percona.com and this is going over an example where you have a publisher and a subscriber. And he said there's all sorts of different system views you can use to assess the state of the logical replication on the publisher side and a few that you can use on the subscriber side. And each of the subscriptions that have been set up to read from a publisher do create a replication slot. And you can use functions like pggetreplication styletes to query the slot state, for example, and each of these slots maintains a log sequence number to know what's been transmitted to the subscriber. Now, he actually wanted to throw a monkey wrench into the logical replication process. [00:12:11] So he was replicating this table and he actually added a unique constraint to the subscriber but not the publisher. So then he created some data that would cause a conflict on the publisher. And basically the logical replication encountered this conflict and stopped replication until it can be resolved. And you can see an error here. Duplicate key value violates unique constraint and it says the details that the key already exists. So basically this needs to be resolved before you can move forward. [00:12:42] Now, one way to do it is of course is to remove the unique constraint on the subscriber. That would be a way to resolve it and have it catch up normally. But let's say you can't do that. Well, you can skip over changes. So there might be a scenario where you consider this duplicate data and you want to skip over it. He's using two functions to do that. One is pglogical slot peak binary changes and pgreplicationslotadvance so he uses this query here to identify the records that will need to be skipped over along with the transaction ID and lsn. Then you can use that to advance the slot to skip over that duplicate data essentially. But of course doing things like this can get you in more trouble than you expect. So you definitely want to do what is suggested here and read the logs on both sides to assess the root cause of the issue. [00:13:39] But if you want to learn more about this, definitely check out this blog post Next piece of Content transaction pooling for Postgres with PGCAT this is from enterprisedb.com and I think it's safe to say PGBouncer is the most popular pooler available for Postgres. However, PGCAT is an interesting option as well and he went ahead and set that up and actually tested a single backend connection and showing two different clients connecting to the bouncer to see what the effects of and basically understanding how pgcat works from an engineering perspective. Many added in two backend connections, a pool size of two, and then observed the behavior from that perspective. So this is pretty long blog post, but it definitely gives you insight into how pgcat works to see if you might want to consider it as a pooler as well. [00:14:29] Next piece of content Returning Multiple rows with Postgres extensions this is from pgedge.com and this is another example of creating your own extension and and he's actually creating a meminfo extension that's basically going to read the proc file system to give you information about memory. So basically you can get this information right in postgres. So it goes through the whole process of writing this extension. So if you're interested in that, you can check out this blog post. [00:14:57] Next piece of content waiting for PostgreSQL19 add psql prompt variable for search path so basically the search path can now be used as part of the psql prompt and it has the new option percent s that you can add into your prompt to be able to see that. So this hopefully will get into Postgres19. Next blog post related to that is PostgreSQL19. Two nice little improvements log autoanalyze min duration and search path in the psql prompt. So we've already mentioned the second one, but the first one there already exists a configuration variable log autovacuumenduration that logs details about an autovacuum run. Well this does the same thing for analyze, so he set it for a small duration of 1 millisecond and then when analyze was run on this table, you can see the output of it. So now we can get logged information for vacuums as well as analyze. And the last piece of content is actually a site called PG Feedback featuredif.com so you put your particular version on the left side here and whatever version you're going to here, and it shows all the different features that have been added to postgres in various categories. So that looks like a pretty neat tool to use. [00:16:18] And now it's time for my consulting corner. So just an update on some things I'm working on. [00:16:24] I'm looking into setting up a partitioning structure in a tiered fashion, meaning not just one tier of partitions, but actually two. Now I've actually done this historically for my app where I've done a hash of the tenant ID essentially and then by a timestamp. But I'm actually looking into doing list partitioning and then date range partitions. Or it might be date range partitions and list partitioning. Still trying to decide how this would be queried more often. So I actually have a question for you. Do you have any experience with multiple tiers of partitioning? And do you like to always go with timestamp first or maybe timestamp second? I mean, frankly, in my experience it's based upon the use case of course. But let me know in the comments. [00:17:14] I hope you enjoyed this episode. Be sure to check out scalingpostgrows.com where you can find links to all the content mentioned, as well as sign up to receive weekly notifications of each episode. And there you can find an audio version of the show as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 77

Timeseries Queries, Arrays, Ranges, C Triggers | Scaling Postgres 77

Episode 374

Migrating Millions Of Databases | Scaling Postgres 374

Episode 148

Hash Indexes, Int Float Numeric Types, Postgres 14 Features, Data Science | Scaling Postgres 148