Mitigate Retry Storms? | Scaling Postgres 403

[00:00:00] One issue I've seen with some of my clients is having problems with retries. In other words, a lot of times applications are designed that if some process fails, okay, we'll try it again. But sometimes that starts a retry storm where it keeps trying it again and again and again, which if you don't have certain protections in place, basically increase the number of connections being used, increase the CPU utilization on the database, basically causes a lot of problems. And this was also something that OpenAI mentioned in their post we covered last week. They listed several issues that ultimately wound up in causing too much database load, but we're going to talk about one configuration option I haven't heard a lot about, but it could be used as a tool to help mitigate overloading your database. So we'll talk about that this week, but but I hope you, your friends, family and coworkers continue to do well Our first piece of content is Postgres client connection check interval. This is from ardentperf.com and he said he found this post on LinkedIn and it's been in Postgres since version 14 and it's called Client Connection Check Interval. And what this does the postgres server periodically checks with the client to make sure it's still alive while the query is running, and it cancels it if the client is dead or has disconnected. So a particular use case, he says here, maybe a client starts a long query, then the client dies, either from a crash or a network outage. Then postgres keeps on running that query and only after finishing, which may take hours, it notices the connection is dead. But I think the retry case is another common scenario and he says still surprised this isn't enabled by default. [00:01:49] So this blog post is actually talking about an issue he blogged about last week. I think that was mentioned in scaling postgres, but not the show it was in the links. So you can look at the link here or look at last week's show, but the article from last week is called How Blocking Lock Brownouts Can Escalate from Row Level to Complete System Outages. And specifically he was talking about issues running through PGBouncer and and something that did assist is turning on peering for pgbouncer to make sure the cancel request goes to the right server. But even with that I think he was still seeing some issues. [00:02:30] But this setting where the server itself can just go ahead and cancel a query because the client's no longer there helps in even more scenarios and particularly the unexpected behaviors that he saw and he did his test suite again, and you can see the tps, which is green drops like a rock. [00:02:51] The total connections hits the max of 100. In his testing scenario, the PGBouncer client is waiting its max as well until the situation is resolved. So we set the client connection check to 15 seconds and you can see the brown line. The connection line does start dropping after about 15 seconds, and you'll see a spike in the green. The transactions per second goes up there, and the pgbouncer client waiting drops as well, but it still hits the connection limit and the pgbouncer limit as well. So we said, all right, let's drop it to two seconds. And there things looked much, much better. The total connections hit a max of 30, so it never approached saturating the 100 connections. And the PG bouncer went nowhere near its max. And he even tried as low as 500 milliseconds. And I should also note you now have some transactions going through. It's nowhere near the transactions per second before the incident, but the whole system didn't go down. And then he's tried even 500 milliseconds for this client connection check. And again the client connections still stayed very low and transactions still went through during this particular issue. [00:04:08] And then he showed the different client connection settings where pgbouncers were in a close wait state. And basically at the smallest client connection check of 500 milliseconds, you can tell the number of pgbouncer connections, and that close wait state is essentially zero. So if you've ever had to deal with retry storms or you have postgres sessions that are still running even though clients have stopped, maybe you want to check out this parameter and start using it. Next piece of content. Importance of tuning checkpoint in PostgreSQL this is from percona.com and first thing he says is, well, what is a checkpoint? It's basically where it flushes the state of the database memory, primarily shared buffers, onto the disk, and it basically sets a consistent state for the database for recovery purposes. And it does that by identifying what are the dirty buffers, what needs to be written to disk writing to the operating system over a period of time so it doesn't impact the performance of the server too much, because too much checkpointing activity can cause a performance impact. F syncing the file to ensure it's actually written to the storage, updating the postgres control file with the redo location so we know we only need wall files after that point and then recycle the Wall segments. This could be deleting them, or if you have archiving turned on, sending those segments to the archive log directory or some sort of archiving service. Now, when checkpoints are running, typically on a TPS graph, you can see this, what he calls a sawtooth pattern where performance drops after the checkpoint a and then it builds up again and drops and builds up. And he says most of the performance impact is from having to record full page image rights to the wall file because postgres has to log the entire content of a page the first time it's modified after a checkpoint. [00:05:58] So there's a lot of full page image writes that happen right after a checkpoint. Now he says, all right, how much I O can we save if we tune? So he's using PGBench with little over a million transactions and at the default checkpoint timeout of five minutes, you get 12 gigabytes of wall generated. But if you expand the checkpoint out to as long as one hour, you only generate two gigabytes of walls. So longer checkpoints prevent you from having to write so many full page image rights to the wall files. And he shows that here in terms of full page images written, you have a nine fold savings by doing an hourly checkpoint. So in terms of counter arguments and fears to extending a checkpoint timeout, he said the biggest one is people worrying about how long recovery will take. So if you have a crash event right before an hourly checkpoint, well, you have to write that hours worth of data before the database system comes up and is available again. But he says the reality is that most of the critical systems will have a standby available. It could be synchronous or asynchronous, or even a high availability solution like Patroni. So you can just fail over immediately to a replica and it can start taking over the traffic. You don't need to wait for the old primary to finish its crash recovery process. [00:07:26] Therefore, he says, quote, the time it takes for crash recovery becomes irrelevant. And he also says if you're doing fewer checkpoints because you're generating less wall, as we just talked about, you actually have less wall to write when you're doing that recovery process. [00:07:42] And he says the other thing to consider is that it's not like it's going to take you an hour to get to a consistent state. [00:07:49] Typically it takes much, much shorter than that. And he has some examples of recovery that was done here that took 30 seconds, 60 seconds. It depends on the size of your database, of course, anyway. And he says, how can you tune checkpointing? Well, Basically setting that checkpoint timeout. Next is Zain MaxWall size, hopefully. So you're not seeing checkpoint warning messages being hit. So that would be another configuration option I would adjust. And lastly, the checkpoint completion target, which is the fraction of the checkpoint timeout that it should take to complete the checkpoint. So giving it more time to complete that checkpoint causes less burden on the system. And typically for production, you're going to set it around 90%. And he does talk about logging and you can of course log your checkpoints. But the other important thing that I look at isn't really mentioned here is that checkpoint warning. And that warning says basically we had to do a checkpoint before your timeout because the maxwall size configuration is insufficient. So if you're seeing warning messages, that helps. You know you need to extend your Maxwell size. But check this out if you want to learn more. Next piece of content, the skip scan you already had before version 18. The this is from Robbins.in and he says, you know, we got skip scanned in version 18, but the reality is we have had a version of a way to do skip scanning. I mean, it wasn't really a skip scan, but you could still use an index even though you're not querying the leading column. What it does is basically does a sequential scan of the whole index, which is usually more efficient than scanning the whole table, doing a sequential scan of the table. [00:09:30] But he says the skip scan that exists in version 18 is great for you. Where you have the leading column is low cardinality. So imagine a status field that has four or five values. But he shows an example here where you have 10 distinct values in a leading column. And querying it in Postgres 17 versus 18, you're basically able to achieve seven times more transactions per second on the skip scan version of 18. [00:09:57] Where Things Fall apart is where that leading column has a high cardinality. So that feature is not going to kick in in version 18 and you're going to get similar performance to version 17 because it's just doing a full scan of the index, which is slow but faster than scanning the whole table usually. But if you want to learn more, you can check out this blog post. Next piece of content, it's 2026. Just use Postgres. This is from tigerdata.com and we've seen posts like this before. But what I thought was interesting is that he says, you know, a lot of recommendations say to use the right tool for the job. Well, if you were to do that and you wanted search, you'd probably choose elasticsearch. If you wanted vectors, you might choose Pinecone. If you wanted caching, you might choose Redis. If you wanted to store documents, you might use MongoDB. If you wanted a queuing system, you might use Kafka. If you want a time series, maybe a dedicated database like InfluxDB and Postgres for everything else. So to me this would seem like a bit of a nightmare to manage all this. And as you know, of course Postgres can handle all of this. But what's interesting about this blog post is it shows the exact extensions you would need to use to handle each of these responsibilities. So if you wanted a full text search, well you could use postgres built in full text search. Or you could use their extension pgtext search which uses BM25 indexes. Or there's other vendors that provide that. If you wanted vector search, of course there's PGvector and there's other extensions like PGvectorscale that tiger data developed. In terms of time series, well they're literally the timescale database. If you wanted caching, you could use unlogged tables. If you wanted documents, of course there's JSON B and postgres. And if you needed geospatial, you can use postgis and they have the code down here to add those different extensions and actually how to use them for different use cases. So check this out if you want to learn more. Next piece of content Again, going back to the argument, just use Postgres. We have 10 Elasticsearch production issues and how postgres avoids them. This is from tigerdata.com although I think only half of these are actually a problem with elasticsearch. [00:12:09] Like the first one is JVM Garbage collection pauses. Well, that could only happen in elasticsearch because Postgres is not written in Java. You could get a mapping explosion in elasticsearch, but postgres schemas are not susceptible to that over sharding, so you have to shard ahead of time with elasticsearch, whereas postgres, hopefully you can scale more easily. Next is deep pagination performance. Cliff Basically, if you have a paging interface and you're looking at the 500th page, it takes forever to load. [00:12:41] Well, you're going to get the same thing on Postgres if you're using limit offset. So the way to avoid that is using the little thing here mentioned. Keyset pagination Again, susceptible to split brain and data loss. Again, Postgres can have this too, but because of the eventual consistency and the distributed architecture of elasticsearch. I'd say this is a little bit easier to deal with with postgres, but you have other trade offs and again the eventual consistency surprises. That's definitely something in elasticsearch, but again some of the rest of these are coming for both systems. You could get security misconfigurations in both of them. You could encounter monitoring complexity for them, pipeline sync issues and infrastructure costs. So personally I think about half of these are good reasons, but they basically Talk about their BM25 index that is available via the PGTech search index and check this out if you want to learn more. Next Piece of content companies behind Postgres18 development this is from the consensus.dev and he broke down the commits as opposed to by person by company. So you can see how many inserts, deletions, commits and contributors across different companies. And we'll just mention the first top three here. [00:13:54] Number one is Enterprise db of course, second is Microsoft and third is Amazon. But again, thank you for everyone and every organization that helps make Postgres what it is. Next piece of content Over 15 years later, some solutions are still great solutions. This is from RichYen.com and this is a blog post to follow on the one that OpenAI did about scaling their database and he noticed a lot of similarities with scaling his database number of years ago. So he said some of the techniques were not so different from 15 years ago. [00:14:32] And again, that's something I mentioned when I started looking at that blog post is that some of this is probably going to be old hat for you because they are typical steps you need to take when you scale postgres. But he highlights some of the things that they did and OpenAI did in terms of reducing load on the primary, implementing connection pooling and isolating workloads, as well as scaling read replicas. So if you want to learn more about that, check this out. Next Piece of content why OpenAI should use Postgres distributed this is from enterprisedb.com and this again kind of mirrors what I mentioned about. You know, they probably should have considered sharding a little bit earlier. Well, they're advocating that, but they also have a product to sell which is their active active postgres. So of course they're endorsing that. And the last piece of content is Hackorum, a form style view of PG hackers. This is from Perona community and most of postgres development happens through the postgres Google hackers mailing list and they wanted to enhance the experience of consuming that mailing list, so they decided to build a form solution for it. Right now it is read only, they say for now, and it's more like a forum presentation of the information. [00:15:46] And they do say it is a work in progress or a proof of concept. So here's all the feature it has now and what they plan to do next, and here is an example of what it looks like. [00:15:58] So I guess if you want to keep the pulse of what's going on in postgres development, you can definitely check out the site. I hope you enjoyed this episode. Be sure to check out scalingpostgrows.com where you can find links to all the content I discussed, as well as sign up to receive weekly notifications of each episode There. You can also find an audio version of the show, as well as a full transcript. Thanks. I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 319

When Postgres Development Stopped! | Scaling Postgres 319

Episode 347

Easy Incremental Data Processing? | Scaling Postgres 347

Episode 359

Trillions of Rows & Metrics | Scaling Postgres 359