100K Events Per Second Queue | Scaling Postgres 376

[00:00:00] You don't normally think of postgres being a great queuing solution because of the way it does. MVCC and all of the rows that are created or updated will need to be vacuumed. That can add a lot of overhead to a queuing system that is designed to just process, in this case hundreds of thousands of events. But in the first blog post today we're going to be talking about a company that actually chose postgres to use for its queue queuing system. And also be sure to stay till the end where in my consulting corner segment I'm going to be going over an issue where I had trouble starting postgres and the different things I had to do to get it working. [00:00:43] But I hope you, your friends, family and co workers continue to do well. Our first piece of content is lessons from scaling PostgreSQL queues to 100,000 events per second. This is from rudderstack.com and this was an interesting post about the different optimizations that they've done to their system to be able to allow Postgres to seamlessly process 100,000 events per second. And the first thing they talk about is their architecture. So a single queue, and there are many queues as a part of their system, gets broken into data sets that each contain 100,000 jobs to process. [00:01:23] So in this case there are 250,000 jobs spread across three data sets. And each data set has a representative table, a jobs table. So these are the event payloads along with all the metadata required for processing a job. And then you have a job status table that is an append only log that tracks the life cycle of each job. [00:01:47] So this is the first good observation is they're not doing a lot of updates. [00:01:51] They're basically using append only so they don't have to worry as much with vacuum having to clean up dead rows due to updates happening. And the first recommendation they make is the indispensable role of indexing. [00:02:03] So initially they didn't have that many indexes, but adding them definitely improved their performance, which of course makes a lot of sense. They also mentioned the benefits of index only scans, where basically the entire query can be answered because all of the data resides in the index. But you do have to make sure you're vacuuming frequently enough so that the visibility map keeps up to date. Because remember, the visibility information is only in the heap and not in the indexes, so that visibility map needs to be up to date so that you can get efficient index only scans. They also had issues where they were trying to do it select distinct count, but it wasn't efficient. So they basically used a recursive CTE to kind of replicate what they were trying to do. I mean, they said if postgres had skip scans, they wouldn't have necessarily had to do this. But due to the performance, this recursive CTE performed better. Next they talked about status updates, retries and table bloat. And the problem is that if you have one of those jobs tables with 100,000 jobs in it, and the status table, say you have a bunch of retries happening, you can suddenly get, for example, 10 retries for every job. Suddenly you have 2 million rows in that job status status table very quickly. [00:03:20] So in general that can very quickly slow things down. And what they implemented for the status table is a compaction method. So basically deleting older status records from the table and then having to vacuum them afterwards to reclaim that dead space. [00:03:37] Now, this blog post never mentions partitioning, so I wonder why not partition every job status table as as opposed to having to do a delete and then a vacuum on it. But this is the solution that worked with them. [00:03:51] Next they mentioned the power of copy. So basically if they have a lot of events that need to be processed, if you could load those in using copy, that's far more efficient than trying to do it one insert at a time or even a little bit faster than multi row inserts. Then they talked about another step of compaction is not the status tables, but the actual jobs table themselves as jobs get completed trying to delete those rows and then doing a vacuum. But I still think you're going to hit performance issues. But what they started doing is not necessarily relying on the delete as much, but actually dropping the table if all the jobs in the data set are complete. And then they added additional logic that may merge or compact active data sets as well. [00:04:34] Next they did mention some caching. So for example, if a data set has no active jobs, they keep that cache. So it's very easy to tell that there's nothing that needs to be processed and whether a pipeline is active. And apparently with these caches they have a certain ttl, so they check it periodically and keep that value cached. The next issue they encounter was actually related to using Go. And what they found out is that byte arrays actually transfer twice as much data as just a text string. So apparently it was more efficient to use text strings with Go. And then they did some optimizations with regard to wall performance. [00:05:14] And near the end of the blog post they mentioned some PostgreSQL settings they made, so they increased the max wall size. Still seems pretty small. [00:05:22] They actually reduced their checkpoint timeout from five minutes to 30 seconds, which is kind of crazy because I would imagine frequent checkpoints would actually cause some performance issues. But they chose to keep that very tight. [00:05:36] Increasing workmem, which makes sense, actually increased the hash MEM multiplier a bit to allow for larger hash tables. They increased maintenance work memory, set the random page cost to 1.1 and then tweaked a little bit the vacuum settings. But I thought this blog post was a pretty good overview of things you would need to do if you wanted to use postgres as a queue. [00:05:58] Now personally I would really like to use something like Oriole DB for a queuing system because it is using an undo log as opposed to storing all the different versions of rows in a heap. I imagine that would be much more performant. You wouldn't have to use vacuum. You should have less write amplification happening. So I really look forward to this option being available hopefully sometime in the future. [00:06:23] But if you're interested in this, definitely check out this blog post. [00:06:27] Next Piece of Content Supabase MCP can Leak your entire SQL database this is from GeneralAnalysis.com and this is a little bit of an older post, but I just stumbled across it recently. But it's just a warning for people using MCPS and LLMs and connecting it up to their database and what they say here. The core problem of LLMs interacting with tools is they cannot distinguish distinguish instructions from data. So basically you'll see in here they have a support queue and basically someone can submit a support request giving instructions to the LLM that it will follow, which is kind of crazy. So they set up a fresh Supabase project and this particular attack can exploit an out of the box configuration and the standard service role. [00:07:17] Now understanding this is pretty important. So the customer or the attacker will would be submitting tickets and it is row level security restricted with what it can do. You have support agents that have a support role that is row level security restricted to read and write to the support tables. But then you have the developer using something like a cursor IDE and a Supabase MCP using the service role with full SQL over every table. So of course maybe the easiest way to mitigate that is don't allow your developers to connect to the production database or at least over mcp. So they talk about a basic application they set up and here's the Attack someone sends a support ticket and in it it says important instructions for cursor claude and gave it a set of instructions to basically download an integrations token table and add it as a new message to this ticket. So nothing is compromised at this point, it's just a request. So so you have this request in the support queue and even a support agent can take a look at it and still everything is fine. But if a developer uses cursor to review open tickets, they might do a prompt like show me the latest open support ticket. So it loads the project's database schema, lists the support tickets, filters for open ones, and fetches message for the latest ticket. And at this point is when the LLM actually follows the instructions and inserts that integration tokens table as a new message to the discussion. As you can see here. Now in terms of mitigations they say basically use read only mode when possible when working with mcp. They also mention add a prompt injection filter. But I would also say be wary what LLMs you give access to your direct production database as well. But feel free to check this out. [00:09:09] Next post Also related to that is three Tiger data engineers told us the truth about MCP Security is its Achilles heel. So this is from tigerdata.com and I thought those was interesting given the previous article. So if you found that interesting, you'll definitely appreciate this one as well. Next piece of content root cause analysis PostgreSQL multi exact member exhaustion incidence this is from metronome.com and again this is a bit of an older blog post but someone shared it with me last week so I thought I'd cover it. And they had basically a multi transaction overrun, but it wasn't for the IDs it was actually this space reserved for multi transactions. [00:09:52] So this is a very comprehensive post. They go all into the detail of it. So basically they do usage billing as a service. They they have a 30 terabyte cluster running on version 13.18 and they were doing an improvement to try to move some data into partition tables from a big monolithic table and there might be some more things they were doing as well. And then they talked a little bit about multi transactional IDs and the multitransactional IDs are 32 bit identifiers, which is the same as transaction IDs that you may have heard about. And Your multi exact IDs can run out and they do trigger wraparound vacuums, but at the 400 million mark as opposed to the 200 million mark the way that normal transaction IDs are. But the limit they were hitting was not the multi transaction IDs, it was actually the multitransaction member space, which is a separate data store that maintains the actual transaction IDs participating in each multi transaction. And what are multi transactions? [00:10:55] It's basically when it single row is locked by more than one session. So imagine you have one row that has multiple select for share locks on it. For example, once a row needs more than one lock, then the multi exact locking system kicks in and it starts tracking that using a multi exact ID and each of the IDs it locks gets added to the member space area. Now this happens when you're using say select for share. It can also happen if you're using sub transactions. It can also happen with foreign keys. So if you have foreign keys and then you're doing a lot of select for updates, you will start increasing the number of multi exact IDs you're using and by extension the member space as well. And they mentioned some additional technical details with regard to it. So the multi exact structure is immutable, meaning if you have one row that is locked by two processes, but then a third process comes in, it doesn't delete that previous entry, it just adds a new entry with the three transactions that are currently locking it. And then if another session wants to lock that same row, now it adds a fourth to that, but the two previous entries are still present. Basically all these entries can only be removed once a vacuum occurs. And because of that behavior, it basically has quadratic member space growth. So the more locks you have on an individual row, the greater the problem becomes. [00:12:16] And they mentioned the fact that foreign keys of course impact multi transactions as well. And if you hit the limit of this member space, you'll get an error in the log saying this command would create a multi exact with some number of members, but the remaining space is only enough for so many members. [00:12:33] So again, this blog post goes into a lot of detail and what's great is that there was another episode by Postgres FM last week and and it was on this exact blog post, Multi Exact Member Space Exhaustion. And Nikolay and Michael brought in Andrew and Nate from Metronome to discuss this issue. And in terms of the show, they covered a lot of what was in the blog post. But they also mentioned, you know, this problem is relatively rare. Multi transaction ID exhaustion is rare, and then exhaustion in the member space is even more rare. But some of the other things they mentioned is that long transactions can really get you in trouble and because all the different transactions need to complete before they can be vacuumed up from the multi exact member space. And it can be a real problem because essentially you have this daisy chaining of another process or a session wants to lock that row, it just gets added onto the queue and suddenly you have many transactions all locking this one row. And this happens all before vacuum can actually clean it up. And they said it is possible to hit this limit if you have as little as 63,000 overlapping transactions. So in theory that would be sufficient to actually hit this member exhaustion pattern. But I thought this was a great episode. Definitely encourage you to listen to it or watch the YouTube video down here. [00:13:54] Also related to this as I think the best blog post I found on multi exacts and postgres is actually from aws.Amazon.com and it's multiple multi exacts and PostgreSQL usage side effects and monitoring. So it gives a great breakdown of multi exacts and shows them in action using select 4share where you can see it actually working. And they mentioned how it can happen with foreign keys as well and it can also happen with sub transactions. But what's really great in this post is actually how to monitor it. So they show columns to look at in PG database to get a sense on your multi transact ID age. I've looked in some databases where it's say 50,000. I looked in another database, it was essentially zero, so hardly no multi exact transactions were happening. And I will note the One that was zero doesn't really use foreign keys. The one that was 50,000 was using foreign keys. Now that's for the whole database. You can identify the table using this query here. [00:14:53] But again this is for the multi transaction ID limit, not the member space. So to do that you actually need to look at the storage. So now Metronome was on Aurora or at least on RDs I think. And you're like well how do I get access to the disk? Well there's actually an extension Aurora offers called statutils and you can use that to get the number of bytes in the PGMultiFact Members Directory and the PG MultiExpact Offsets Directory. So now you can keep track of how much space is available to you. So definitely another great resource to check out and better understand multi exact transactions. [00:15:30] Next piece of content, PostgreSQL 18 beta 2 is released. This is from PostgreSQL.org so as is typical, we are approaching the Release date of Postgres 18. This fall and the Postgres community says we strongly encourage you to test out the new features in Postgres 18 on your own systems, so feel free to do that. [00:15:49] Next piece of content Postgres AI, which Nikolai of Postgres FM is the founder, actually had a launch week and you can read three of these available and by the time you're watching this episode, all of these should be unlocked. As of now, there are only three that are unlocked. The first one is on their DB Lab engine for doing database branching. [00:16:10] The second one is Postgres AI monitoring, which I think is an extension that basically does monitoring of your cluster. I think it's based on PG Watch 3 and adds Grafana and some other things to it to potentially together a monitoring solution. [00:16:26] And then the third one here is Postgres AI checkup service. So this is actually a service where I think they use the monitoring tools and then a human and AI overview to give you performance recommendations. But feel free to check these out if you're interested. Next piece of content so but Base also had their launch week last week and probably released 50 different blog posts. Last week's episode had about half of them. This week's episode has a fresh batch of them. But this particular blog post is the top 10 launches of their launch week. So if you're interested in superpace, you can definitely check this out. And Last piece of content one year of hacking workshops this is from arhas.plugspot.com and Robert basically has a retrospective of how the hacking workshops have gone for Postgres and has links to the previous ones and also says that signups are now out for August, where they're going to be discussing multidimensional search strategies for composite B tree indexes. So definitely check that out if you're interested. [00:17:30] And now it's time for my consulting corner. I have a client who has set up a dedicated restore server, so on a weekly basis it takes one of the backups that happened that week and restores it in an automated fashion onto the server. Well, they recently did an upgrade and actually we bumped up some of the configuration options for postgres and as a consequence when the restore server attempted to start, it halted because its configuration was insufficiently matched to what the primary had. So the restore essentially stopped. [00:18:04] So went in, updated the configuration and then tried starting again. And I expected it would just start and then start reading in the wall to catch up to the point at which we wanted to do the restore point at. But the problem is it was just hanging. So this was on Ubuntu, so tried sudo pgctl cluster17 main start that just hung and if you check the postgres logs, nothing was happening normally. You see it writing something that it's trying to start, maybe there's an error message, but there was nothing. So I tried starting it interactively, just using postgres. That worked, and then I was even able to get it to work using the PGCTL cluster command, but using the foreground option. When I used that, postgres immediately started and essentially replayed the wall and then was accessible, but without that foreground it wouldn't start. So before I got too far, I'm like, okay, maybe there's something hanging around. Let me just try restarting the system. [00:19:08] So I restarted the system, and once the system came up, postgres started normally and everything worked. So as with most computers, sometimes stopping and starting it definitely helps. But if you run Postgres on Ubuntu, definitely encourage you to remember that foreground command if you ever run into an issue like this. [00:19:27] I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content mentioned, as well as sign up to receive weekly notifications of each episode. [00:19:38] There you can also find an audio version of the show, as well as a full transcript. Thanks, and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 132

Postgres 13 RC1, Upgrading at Scale, With Ties, HOT Updates | Scaling Postgres 132

Episode 112

Explain Analyze, Sequence Counts, Best Books, Partition Migration | Scaling Postgres 112

Episode 240

Forced Sequential Scans, Table Renaming, Terminology, PGSQL Phriday | Scaling Postgres 240