Slow Queries, 4TB Upgrade, Postgres on ZFS, Storage Stampede

[00:00:00] In this episode of Scaling Postgres we talk about slow queries, four terabyte upgrade postgres on ZFS and Storage Stampede. I'm Kristen Jamison and this is Scaling Postgres episode 212. [00:00:19] Michael, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is dealing with slow queries with PostgreSQL. This is from PG IO. They're talking about a number of ways that you can deal with slow queries. And the particular scenario they had in mind is when you do a deployment because they basically say at the top here quote in every deployment there's always a few queries that run too slow. So it's basically you make a change and you've created some slow queries. So how do you track this? Well, the main way that you track this is through PG Stat statements. So you can track it to see what statements are taking the most time in your database. But that has a few disadvantages. One is that it collects cumulative statistics. So if you just did a deployment, there's no way to tell that you've immediately changed that unless you're taking the contents of PG Stat statements and copying it on a periodic basis, say every hour to some location. And then you can see a change that has happened as a result of a deployment, for example. So basically you may need to use some other tool or technique to be able to track changes to these cumulative statistics. The other thing that it doesn't include is query parameters. So those aren't included in the queries in PG Stat statements. Now, that can have an impact because maybe you're querying a column using one value that the index is highly efficient on. It uses an index and returns the query quickly. Whereas another one, maybe it's more efficient to do a sequential scan for that particular value you're querying and you're going to get a sequential scan. So it's going to confound the fact that is this a slow query or not? Maybe it depends on the values that you're selecting. So that's another disadvantage of it. Another area that you can look for slow query logging is in the logs themselves and you can set log min duration statement that if a statement exceeds a particular value, say 5 seconds, then it will be logged to the postgres log. So you can examine that and see what queries are slow. And there's a number of other parameters that you can specify how long you want it to record as well as you can do sampling if it would log too many statements. If you're trying to do that. The other thing that people tend to do is they use the auto explain extension. So basically if any slow query is logged, go ahead and log the explain plan for that query as well. So they talked about setting that up and setting some configuration variables to get that working. Then you have the issue of still executing queries so a query only ends up in PGSTAT statements once it's completed, but anything that is still running is not going to be there. So you need to look at PG Stat activity to get that information in terms of what's currently running. And you want to be really wary of those that are idle in transaction, particularly if it's a long time, because that could cause a whole host of problems with Postgres with Vacuum not being able to complete and encountering a Txid wraparound issue. Next issue they talk about are locks, so you can log lock weights, so you could turn that on. You can also adjust the deadlock timeout if you need to. And then they also talked about being able to use the PG Locks system view if you want to get real time information on what locks are happening in the system. But these are different places that you can go that you can try to figure out how to deal with slow queries in Postgres next piece of content, how Retool upgraded our four terabyte main application PostgreSQL database. This is from Retool.com and they're talking about an upgrade that essentially they needed to do because they were on version 9.6 that was slated for end of Life on November 11, 2021, and they wanted to do an upgrade and they chose 13 at the time. They chose 13 because they wanted to have the most amount of time before they would have to upgrade again. And also PostgreSQL 13 had a number of enhancements they were looking forward to, such as parallel query executions, parallel vacuuming of indexes, as well as adding default safely to columns. So in terms of the upgrade strategy, the easiest of course to do is a PG dump and then a PG restore. But at four terabytes, as he mentions here, that's going to take days. And really they wanted to have 1 hour of downtime. Now whenever I hear something like that, I think PG upgrade. However, they're using Azure's hosted Postgres solution and I guess you can't use PG upgrade because they didn't mention it in this post because I've converted even databases larger than this using PG upgrade and it takes less than a minute, two minutes max to do the actual conversion. There's preparation that needs to be done as a part of it, but the actual upgrade itself takes less than a minute usually. But they decided to do a logical replication solution. So basically get another database in logical replication, sync with their primary in the new version of Postgres and then transition to it. Now, they tried using Azure's Database migration service, but it caused them a number of issues to the point that it caused some long running transactions that blocked Postgres's Auto vacuum, but eventually caused them some downtime. Now this wasn't during the upgrade, but it looks like it was in preparation leading up to it, so that's unfortunate. But what they ultimately decided to use is something called Warp. I mean, they also consider PG logical, but that doesn't work on the platform. They were on Azure's platform for their single server postgres product. And really that's the best solution because logical replication didn't really exist prior to version ten. So you're going to need to use some sort of extension to do it. But they chose Warp to do it. But even that it looked like it had some issues. Now, they seemed very concerned about the time, which I'm a little confused by what this post is mentioning because my understanding of doing a logical replication upgrade you have them in sync and you can take your time to make sure that everything is exactly the way you want to and then you switch the transition time and it's relatively fast. But they were talking about skipping some tables in Warp. So manually converting two massive tables. Now, one was a logs table, but the other one seemed pretty important. So not quite sure why you would leave that behind and do that conversion separate. I wasn't clear to me in the blog post. But they did do a fair amount of testing in staging and some practice run through. So that's definitely all good to do a good conversion. And then another issue, they had to remove foreign key constraints with doing the conversion. So really a bit inconvenient some of this, which is why I definitely like using PG upgrade. But if you want to learn more about their transition maybe you're on Azure and you're thinking of a way to try to migrate to a more recent version. This is definitely a great post to review the challenges and success they had. [00:07:16] The next piece of content, our experience with PostgreSQL on ZFS. This is from Lackofimagination.org and typically if you're using postgres on Linux, I tend to use Ubuntu. It uses the Ext four file system. But ZFS has been continuing to mature. And then I actually have some clients that are using ZFS for their postgres instances and they're primarily using it for the compression reason. But they mentioned some of their reasons for using ZFS. The first one is being able to do file system snapshots instantaneously. So very quickly do a snapshot and they're actually using that to do their backups. So they're using the snapshots as their backup on an hourly basis. So it is of course better to use a point in time recovery mechanism of keeping track of all the wall because with that you can get a point in time Restore done whereas if you're doing hourly snapshots, you can only restore to an hour point. But they found this advantageous. The second reason, like I mentioned, is the built in compression. So in their case their compression was able to get a 4.5 x compression ratio which is pretty large. So basically they're saving a lot of space in their database by compressing it and it can actually result in fewer disk accesses, although in exchange for some increased CPU usage to handle the compression. And the third reason they mentioned is built in support for secondary caching on SSD disks. So they are using cloud instances but some cloud instances have disks on the instances themselves, like AWS and they're actually using these instance volumes for a read cache so it gives them better performance by caching reads on those as opposed to having go through the storage system. And it has other nice features. They like being able to detect bitrot and essentially its own implementation of software Raid. And then they go into the backups they're using basically the snapshots. So they do 23 hourly and 14 daily snapshots. So that's how they're doing their backups. And then they give you all the commands to set up ZFS on a Ubuntu machine. So they show you creating the main ZFS pool here with two drives that they're using in Raid Zero. Then they add their local SSD disks as the cache created their volumes, set up the compression and then added a few other parameters and then did the installation of postgres. And the other thing that they did is they also turned off full page writes in postgres when using ZFS because as they say, quote by design it's impossible to write partial pages on ZFS. And then they give you the backup script that they use and they're doing the backup from a separate machine that accesses over SSH, the primary machine to do the backup. Although I would probably just have this a script on the local machine to avoid network issues. And there's also a comment about some of this below as well. But this is a great post that shows you how to get postgres set up on ZFS. [00:10:19] The next piece of content steady storage stampede. This is from Enterprisedb.com and they're talking about a scenario where say you're moving your database to a cloud solution and you're picking kind of the default for the volumes that you're going to be using. In this case they're using AWS, the GP two storage with 1500 I Ops and 3000 IOPS burst performance. And they ran a PG bench. Benchmark. And they noticed that the transactions per second was around 2000, but then it started falling off a cliff and going up and down, being very erratic for the duration of their test. And they wondered what was going on. And they looked at the actual Linux dirty buffer ram and they noticed that it's increasing. Increasing to essentially the limit of it when it starts to become erratic. And then even once the test was stopped, it took a fair amount of time at 215 seconds before the buffers were fully drained and stored on the disks. So this part's bad, but even taking so long to drain the memory and have it get on the disks is bad as well. So they said all right, well, let's increase essentially the size of the dirty buffers up to 16GB. And they use the VM dirty bytes to do this. And now you see really high TPS and then it just falls off when it stops. But what's happening when you look at the Linux dirty buffers, you can see all of that is being filled. It doesn't reach the 16 megabytes they set, but then it takes forever to drain too. So essentially, you're just using up all of that Ram and it's not getting to the disk fast enough. So then they said, well, let's try throttling the buffers. So making the buffers smaller, so using a 1GB instead. Now, here it starts off at 2000, but then it quickly gets saturated. The buffers get saturated at that 1 MB limit, but the performance isn't as erratic and it is in and around 1000 before the test finishes. Okay, so this is an indication that essentially the disks aren't fast enough. So all right, well, let's throw more I ops at the problem. So they used 10,000 I ops to see what they would get. And here, as opposed to 2000, was the peak. Now that you've given disks all this performance, it's now in excess of 5000 I ops. But then it does eventually saturate the dirty buffers and falls back to around four. And this is with 8GB of dirty buffers is what you see here. Now, the thing to keep in mind is that this is a consistent test. And some of this may be hard to test if you have erratic activity in your application because this is a course stage test that is consistent activity. So to me, the point is when you see this kind of erratic behavior in your TPS, like things are doing fine, then suddenly it's doing this. Maybe take a look at what your dirty buffers are doing at the operating system level, because maybe you're saturating that. So maybe you need to either decrease it to kind of throttle it or increase it a little bit more. Or of course, increase your disk performance. And another way you could check this is just looking at the disk I ops of your volume, at least in AWS. You can pretty easily tell if you're saturating that at some point. But if you want to learn more, definitely check out this blog post the next piece of content, the Vanishing Thread, and PostgreSQL TCP connection parameters. So, this was a scenario where they had a Ruby on Rails application and it had their application suddenly was showing a 15 minutes timeout for certain threads. They're saying, what's going on? I mean, maybe the timeouts that we have set are five or ten or 15 seconds, but not 15 minutes. What's going on? So they considered different parts of their architecture. So they have their application, which is they are using puma threads. They have a network load balancer, a PG bouncer, and a database. Well, the database has a statement timeout that's on the order of seconds and they're also using logman duration statement and nothing's getting logged in a 15 minutes interval. So it doesn't look like anything's going on with the database. The PG bouncer, they're doing something similar. They have a query wait timeout. So either the query is going to be executed or the connection is canceled. So it shouldn't be happening there. So it looks like something between the application and the network load balancer, or maybe the load balancer to BG balancer. But when they looked at from the application to the load balancer, they noticed a lot of TCP retransmits and they occurred prior to these 15 minutes timeouts. And it looks like the Linux kernel by default will quote retransmit unacknowledged packets up to 15 times with exponential backoff. So basically the default settings can lead up to a 15 minutes delay. And they saw with one of their background workers that doesn't have an application level timeout, they saw this error that says connection timed out and that basically is coming from the operating system. And they were able to replicate this type of behavior by using TCP dump and IP tables. They made a connection and then they broke the connection and they got the same behavior when trying to send traffic to the database. So basically how they fix this is modifying the TCP connection parameters, basically shortening the idle time, shortening the keep alive's interval, the count, as well as a TCP user timeout. So by adjusting that, and again, this is application specific, you wouldn't want to just copy these and put it in your application, but with these parameters they were getting things canceled appropriately and not taking 15 minutes to get canceled. But this post, as you can tell, is very, very long, goes into a lot of details of setting these different settings and where they set them to get the behavior that they're looking for. Now this doesn't identify what the exact problem is, but it at least mitigates the problem and allows these connections to be canceled. So the open questions that would result in finally resolving what this potential issue is, is that why does the Ruby PG library sometimes ignore Ruby interrupts or timeouts while waiting for the TCP layer? So why isn't Ruby canceling? If it's not able to get the information it needs from the database, why is it just staying open and not canceling it? And then secondly, what's the root cause of these spikes in retransmissions? Why is it having to retransmit? Is there an issue with the network? Is there an issue with the network load balancer? But we don't have those answers right now. This is a very interesting deep dive into using TCP connection parameters to kind of get around an essentially at this time unknown issue. [00:16:59] The next piece of content PostgreSQL 14 B tree Index reduced Bloat with bottom up deletion. This is from Procona.com and they're talking about trying to prevent page splits by doing bottom up deletion. And basically they showed an example of postgres 13 versus 14 and they created a table, loaded some data in it, applied two indexes, and the sizes of the tables in the indexes were pretty much identical. Then they did four updates of the entire database to update a lot of data in it, and then they checked the sizes again. Between postgres 13 and 14, the tables were exactly the same size, however the indexes were larger in version 13 compared to version 14. So that's essentially this bottom up deletion in action I kind of think of as a just in time vacuum. So it does some cleanup in the pages in advance of vacuum coming into doing things, basically trying to avoid those page splits from happening in the indexes. But if you want to learn more about this, this is the content that five minutes of postgres covered this week. So five Minutes of Postgres episode 14 Hot Updates heap only Tubal updates versus bottom up index deletion in postgres 14. So here Pganalyze.com goes through this post and a few others to get into more detail about how this works. So if you're interested, definitely check out this piece of content as well. [00:18:24] The next piece of content is actually a YouTube video and it's PostgreSQL partitioning tutorial. And this is from the high performance programming YouTube channel. And he explains the reason why people tend to use partitioning is that it can result in some performance improvement, although improvements aren't drastically great. But the more important reason that he also mentions is for maintenance. So when you have say, a terabyte table vacuuming, that terabyte table or is going to take a long time, particularly if it has a lot of indexes where if it's partitioned vacuum can operate much more easily on that. So it's really the maintenance reasons is the main reason I start to move to partitioning tables because the performance, while it is better, it's not blow the doors off better in my experience. But he explains partitioning the different types of partitioning and he also runs through the commands and shows you how to partition a table. So if you're interested in that type of content, definitely check out this video. [00:19:22] The Next Piece of Content two Phase commits for Logical Replication publications and Subscriptions this is from PostgreSQL Fastware.com and they're talking about a new feature added to postgres 15 where you can actually do a two phase commit for your logical replication. So basically when you set up your subscriber, you can choose to make it a two phase commit subscriber. So if you want to learn more about that, definitely check out this blog post. [00:19:48] The next piece of content database migration. SQLite to PostgreSQL. This is from Bytebase.com. Now this isn't someone that had a big application going from SQLite to PostgreSQL, but this is for an embedded solution. So they chose to change their embedded database SQL Lite to using PostgreSQL and how they did that. So if you're interested in that, you can check out this blog post. [00:20:11] Next piece of content is actually a presentation and it's postgres in a Microservices World by Bruce Momgium. And this is 111 slides worth of content. So if you're interested in that, you can definitely check out this presentation. [00:20:26] The next piece of content. The PostgreSQL Person of the week is Barat Ruperetti. If you're interested in learning more about Barat and his contributions to postgres, definitely check out this blog post. [00:20:37] And we had another episode of the Rubber Duck Dev show this past Wednesday evening. This was on. Where should you store your business logic? Basically in your application framework. It may have guidance on where to put things, but what's the best way to structure, particularly a web application? So if you're interested in that type of content, definitely welcome you to check out our show. [00:20:58] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 151

Cleaning Up, Function Performance, 11 Million IOPS, Change Data Capture | Scaling Postgres 151

Episode 335

Postgres 17 Released! | Scaling Postgres 335

Episode 259

Recovery Time, Lost Data, Production Ready, PG16 Highlights | Scaling Postgres 259

Slow Queries, 4TB Upgrade, Postgres on ZFS, Storage Stampede | Scaling Postgres 212