Autovacuum Configuration, Duplicate Key Violations, Pipelining, Tuning max_wal_size

[00:00:00] In this episode of Scaling Postgres, we talk about auto vacuum configuration, duplicate key violations, pipelining, and tuning max wall size. I'm Kristen Jameson. And this is Scaling. Postgres episode 217. [00:00:23] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is, is my auto vacuum configured properly? This is from Dep.com, and he's talking about a scenario where you're wondering if you have auto vacuum configured correctly. And there's a couple of different ways you can do this, but one way he says to do it is how active are your auto vacuum workers? So he basically says to query PGSTAT activity and look for how many processes are running auto vacuum currently, and then see what your auto vacuum max workers is set at. And then he says to log this periodically, he says every five minutes. And basically if you're constantly using all of the workers, then it probably means you need to do more configuration to your auto vacuum. You can also do things like look at how often tables are being vacuumed using the postgres system views. But this is a pretty good way to gauge if you might need to do some additional configuration. As he shows here, there are a lot of different configurations you can adjust for auto vacuum. But the key ones that determine how fast vacuum runs are these four he lists here. Now two of these are related to vacuum, two are related to auto vacuum. But the thing about auto vacuum is sometimes the settings inherent from the vacuum settings. But basically these are the defaults here. So there's a cost delay and a cost limit. Now each task that vacuum has to do is assigned a particular cost. And as those tasks are done, the costs get added up. And once it hits this limit, vacuum or auto vacuum pauses for the delay mentioned here, you can see the vacuum has no delay, but auto vacuum has a ten millisecond delay. So how you would get vacuum to work more quickly is you can increase the cost limit before it pauses, or you can reduce the amount of pause that happens when the cost limit is hit. So by adjusting these, you can make vacuum or auto vacuum faster or even slower if you need to do that, if you're noticing cause a problem. So this post goes through the way that you could do that. He also mentions that there are also table level configurations you can do. So you can change the cost limit and the cost delay per table. So if you have a particularly large table, you can assign them at the table level and then have a generic setting for all the other tables. But this is a pretty simple post that shows you exactly what you would need to tweak to make auto vacuum run faster or slower just by changing a few configuration variables. [00:02:49] The next piece of content, hidden Dangers of duplicate key violations in PostgreSQL and how to avoid them. This is from AWS Amazon.com, and they're talking about when you do an insert and you get this error in your log that says duplicate key value violates unique constraint on a particular index. That happens to be the primary key here. And when this happens, if it happens a lot, it can cause some issues. Now, I've just considered it a nuisance in terms of the log, but whenever this happens, a full transaction is completed, even though you see this error. So you're going to get transaction IDs being used up because each transaction is indeed used when you get this duplicate key violation. When you're doing an insert, the row actually does get inserted, but it's not considered an active row, it's a dead tuple. So you're going to get a dead tuple generated when this happens. So that also means that now Auto Vacuum has to do more cleanup with regards to it. And then finally you've inserted a dead tuple, so now you've got more storage that's being used up. So having this type of key violation that causes an insert to fail results in transactions being used up and the rows actually being inserted and causing these knock on effects. Now, a way to avoid it this post mentions is that you can do on a conflict, do nothing. So what that means is it may do a conflict, but then it's not going to do anything. It's not going to return an error. It will just do no insert as they show here. But the benefit of that is that there's no row entered and there's no transaction that's done. So none of this happens because it's considered a speculative insert I believe they're talking about here. So this is a long blog post that goes into a lot more depth about why this happens and specific examples of the effects of it. But in a nutshell, if you have a lot of these in your log, duplicate key violations, you may want to consider doing an insert on conflict do nothing. And that'll avoid those messages in the logs and avoid creating the transaction, creating the row. So that could potentially get you more throughput. But if you're interested in learning more, definitely check out this blog post. [00:04:59] The next piece of content pipeline mode for better PostgreSQL performance on slow networks. This is from Cyber Hyphen Postgresql.com and they're talking about the new mode that's been added to PostgreSQL 14, where the Lib PQC API supports Pipelining. That means it doesn't have to wait for a response immediately. You can send multiple transactions to the database without waiting for the first one to return. And they go through the different messages that have been added to the protocol that supports this. And this post explores the performance advantages of Pipelining. So they have a simple example here where they have account table and they're going to within a transaction update one account, decrement the other account. Now, without Pipelining, you need to send the begin transaction, wait for the server to return, send the update, wait for it to return, send the other update, wait for it return, and then do the commit. Whereas if you're using Pipelining a pipeline, they mentioned that it's already part of a transaction so that communication isn't needed once you're in pipeline mode and you simply send the update and then send your next update. And whenever it returns, it returns the data. So this can significantly reduce latency between the client and the server. And they measure it here. And what they developed is a C program to actually use this pipelining mode in order to come up with these performance results. Now, they added a 50 millisecond delay between the client and the server for doing these tests. And what you notice is that with Pipelining you essentially have two times the network latency to send the update and get the results back, whereas it's eight times the network latency for non pipelining. So there are definitely some use cases where pipelining can really result in a performance improvement. Now, they said what you can also do to avoid network round trips is actually put your code into a function because essentially this function just gets called once with the call statement and the transaction happens all within the function doing the updates. So that's a way to achieve similar results. But as he mentions, a lot of programming shops don't really like to use a whole lot of functions for that purpose, but that does give you that advantage. Now, this is an interesting post, but personally I'm waiting for a lot of the programming language libraries to catch up and implement Pipelining such that we could use it easily. Like, for example, I do a lot of programming work in Ruby. I would love for the PG Ruby library to be able to support this type of pipelining, but if you want to learn more, you can definitely check out this blog post. [00:07:33] The next piece of content tuning max wall size in PostgreSQL this is from Enterprisedb.com and he's talking about the advantages of tuning max wall size and that from a stock configuration of postgres. They say that the greatest advantage comes from adjusting the shared buffers configuration and they saw a 3.5 performance increase over the stock configuration, whereas tuning max wall size resulted in a 1.5 times increase in performance. But they actually had one test they did that, they saw a ten x performance. So that's pretty large and they go through what this is important for in terms of maintaining wall size. And the main reason is to minimize checkpoints and they show different ways to go through the tuning and monitoring process for it. But I kind of do it a little bit different because they don't mention a setting here that is highly important for me to use in terms of configuring the max wall size. So what I do is I basically determine, all right, how much wall generation do you have, and based upon that, let's pick a checkpoint timeout. Do you want it to be ten minutes? Do you want to be 30 minutes? A longer checkpoint can ease wall writing, but that means if the server has a crash and needs to restart, it's going to take a long time to catch up all that wall, whereas a shorter checkpoint timeout restarts are faster, but you're going to have more wall writing pressure depending on the activity of the database. But once that checkpoint timeout is selected, I choose a checkpoint warning close to that checkpoint timeout. Because what that checkpoint warning does is it tells you when the amount of wall you're generating has exceeded your max wall size and it's going to do an earlier checkpoint than the timeout you've selected. So basically, if you see these warnings in the logs, it means you need to increase your max wall size until you don't really see it doing an early checkpoint and it's pretty much happening at your checkpoint timeout time. But that's personally how I like to set max wall size based upon the activity. But if you want to learn more about it, you can check out this blog post. [00:09:42] The Next piece of Content PG Friday wrapping transactions right round. This is from Enterprisedb.com as well, and it was a little bit hard to read some of this post because their code looks to be all on one line here as opposed to multiple lines. But basically they're talking about transaction IDs and postgres and it's basically a 32 bit ID that wraps around and therefore you have about 2 billion transactions that you can use before they wrap around. And they describe some of the implementation as well as the importance of freezing those tuples such that you avoid the transaction ID wraparound. And they also bring up some issues that can actually cause it. The first one they mention is prepared transactions. So this is where you're basically trying to do two phase commits and you have an orphan transaction that kind of hangs out in the system and prevents those transaction IDs from being reused. The next one they are talking about is long running transactions. So if you'll have a transaction that's open for a really long time, it can't be recycled and reused, which will cause problems. And the third one they mentioned is hot standby Feedback. So again, this is where transaction IDs can't be reused. Again, the other instance that I've seen this a lot more so is orphaned replication slots. So they don't really mention that here, but definitely orphan replication slots can prevent the transaction IDs from being recycled. So that's definitely something else to watch out for. But if you want to learn more, you can definitely check out this blog post. [00:11:09] The Next piece of Content using Pgtap to automate database testing this is from Endpointdev.com. They're talking about a tool called Pgtap, which apparently the Tap means Test Anything protocol. And basically it's an extension you can install in Postgres, and it gives you certain functions that enable you to create tests to validate that your database is in a particular state. Now, their particular use case is that they were setting up, it looks like a row level security and subsecurity policies, and they wanted to ensure that they were accurate. So they created tests within the database using this PG Tap extension to be able to validate that their configuration settings were correct. So it's definitely an interesting tool, but I imagine you could use any testing protocol in any language as long as it's communicating to the database to do a similar thing. Like, for example, I use ruby. So I could imagine using a Ruby testing library that just talks to the database and queries it exactly like it's doing here, and it gives you the same results. But if you want to learn about a database extension that does it, definitely check out this blog post. [00:12:17] The Next Piece of Content Hooks the secret feature powering the postgres Ecosystem this is from Ctodive.com, and they're talking about the capabilities in postgres that allow extensions to hook into the core functionality of postgres to alter it. And they show some of the different hook areas that exist in terms of security hooks that go into security functions related to passwords and user creation function hooks that work during function execution. The Planner hooks executor hooks for processing statements, as well as Plpg SQL hooks for dealing a little bit more with functions. And in this blog post, they show how some well known extensions like Timescale, PG Stats statements, and Superbase use these hooks to make their products do what they do. So if you want to learn more about this, definitely check out this blog post. [00:13:12] The Next Piece of Content how to benchmark performance of Cytus and postgres with HammerDB on Azure this is from Citrusdata.com, and they're talking about benchmarking performance of Citis using the HammerDB tool and how they're measuring the performance of OLTP Workloads, OLAP Workloads as well as HTA AP workloads. But I would say it's also a pretty good post of how to use HammerDB, which is an open source tool for doing these types of benchmarks. So if you want to learn more, you can check out this blog post related to it. This is also the topic that PG Analyze did for their five minutes of postgres. And in five minutes of Postgres episode nine, they cover PostGIS versus GPUs, the tuning of parallel queries and automating Citus extension benchmarks with HammerDB. This goes into more depth with a post that we discovered about HammerDB. So I definitely encourage you to check out this piece of content as well. [00:14:09] The next piece of content distributed database with PostgreSQL atomic commit problems. This is from Haigo CA. [00:14:17] Now, they're saying this is a post about problems, but there's not a lot of solutions. So this is basically talking about the problem of creating a distributed database. So they're talking about the scenario where postgres is looking to use foreign data wrappers that communicate to foreign multiple PostgreSQL servers that have been partitioned in some way. So maybe you have a set of servers and your say, accounts are partitioned across those servers so that you could answer questions in parallel or even dedicate a server for all account related queries with regard to a particular customer. Basically a way to scale out postgres as opposed to just scaling it up. But the problem comes is, what if you have a query that needs to span multiple foreign servers? So maybe you want to increment an account and decrement another account that reside on different servers where you want those to commit all at once or not at all. And one scenario that they are looking into doing that is the two phase commit feature that already exists where you do a prepare transaction and then you commit that prepared transaction. Well, things work fine when you're preparing a transaction because you can send that prepared transaction to multiple hosts. And then if there's an issue with that or if you don't hear about it, it can be rolled back easily. But the problem comes in is, what if all three of these servers respond, okay, I'm ready for this transaction. But then the second step of actually doing the commit, the first two respond and they're able to commit, but then the third hangs. Now you have this orphaned uncommitted transaction that's out there on instance, DN Three, but it's already committed on DN one and two. So how do you reconcile that? And there's no great way at this point of resolving that. So basically this post talks about some of those issues and unfortunately doesn't have a lot of solutions for them right now. So it's definitely something that they're going to be discussing. But if you want to learn more, you can check out this blog post, the Next Piece of Content. Run PostgreSQL on Kubernetes with Procona operator and Pulumi. This is from Procona.com and they're talking about running Kubernetes using their operator along with this service provider called Pulumi, where you use your own language to deploy to cloud infrastructure. So if you want to learn more about doing that, you can definitely check out this blog post, the Next Piece of Content spatial Filters in PG Featureserve with CQL. This is from Crunchydata.com and they're talking about PG Featureserve, which is a web service that allows you to query into PostGIS and postgres to do spatial analysis. And this is particularly talking about the spatial features that have been added. So if you want to learn more, you can check out this blog post, the Next Piece of Content. The PostgreSQL Person of the Week is Jame Casanova. If you're interested in learning more about Jame and his contributions to postgres. Definitely check out this blog post and the last piece of Content we had another episode of the Rubber Duck Dev show this past Wednesday. This last episode was on real world performance enhancements that I did with my application. So if you want to learn about all the different changes that I discovered that can improve performance, including a database, one related to a lateral join, I encourage you to check out our show. [00:17:30] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostcres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 177

Planner Workarounds, Grouping Sets, Lateral Join, Not-So-Easy | Scaling Postgres 177

Episode 118

Safety Systems, Failover Slots, Transaction ID Complications, Repartitioning | Scaling Postgres 118

Episode 187

Sharding, Federation, Linux HugePages, Performance Tuning | Scaling Postgres 187

Autovacuum Configuration, Duplicate Key Violations, Pipelining, Tuning max_wal_size | Scaling Postgres 207