Outage from TXIDs, Breaking Scale, fsync Gate, pg_stat_statements

[00:00:00] In this episode of Scaling Postgres, we talk about an outage from running out of Txids, breaking scale, f sync, gate and PG stats statements. I'm Kristen Jameson and this is Scaling Postgres, episode 50. [00:00:22] Alright, welcome to the 50th episode. And another milestone we recently crossed is over 700 subscribers on YouTube. So that's pretty good. It's a vanity metric, but interesting nonetheless. Now, there's a lot of content to go through, so I'm going to have to move pretty quickly through each piece. But the first article is actually from a MailChimp and they had an outage basically on Sunday, February 3. One of their physical postgres instances saw a significant spike in rights. The spike in rights triggered a transaction ID wraparound issue. So this is definitely a public service message. You definitely need to keep monitoring as you scale your transaction IDs to make sure that they don't wrap around. Now they said here it was due to a spike in rights, so that could be from inserts, that could be from updates. So if there was a large job that was churning through a lot of data, doing updates that could have caused or even deletes that could have caused this type of transaction ID wraparound issue, I believe this was part of an email that was sent to customers talking about important information about an ongoing mandrel outage. Because mandrel is a service of MailChimp. And then there's a second email update on the mandrel outage service restored, so it talks a little bit about it. And there may be a more in depth analysis forthcoming, but these are two links to get some information about it, as well as a Twitter status thread that talked a little bit about it more in depth. But definitely as you're scaling, transaction IDs are a key thing you need to keep track of to make sure it doesn't bring your database down. Another article related to this is from Thebuild.com and it's titled Do Not Change Auto Vacuum Age Settings. So definitely related. And it says a PostgreSQL has two auto vacuum age related settings. Auto Vacuum Freeze Max Age and Vacuum Freeze Table Age. And that by default, the vacuum freeze table age is set at 100 million and the Auto vacuum freeze max Age is set at 200 million. So he said the recommendation that he used to make was increase this to prevent some auto vacuums kicking off too frequently. And I've seen some recommendations up to a billion, sometimes a billion and a half, for things like Auto Vacuum Freeze Max age. But if you do this, he was suggesting do manual vacuum freeze operations on the oldest tables during low traffic periods, but he said the problem was people weren't doing this and that with some enhancements in 9.6 and on, some of the I O penalty is not as severe. So right now he suggests just leaving it at 200 million. So 2 billion is a limit you don't want to hit the limit. So if you bring it to a billion now, you've just got a billion to play with. And I've seen some recommendations, they bring it up to 1.5 billion. If you do that, you really need to keep close watch over it and have monitoring in place to check for, particularly if it's a large table. You're doing a lot of inserts updates, deletes if you have a job gone awry. So something like that may have happened on Mandrel because the issue happened on Sunday. So did something happen on Friday, Saturday that led to an issue on Sunday? Don't know right now, but definitely his recommendation here is to actually leave them at their default or maybe not push them as high up as previously recommended. So the next article is how long will a 64 bit transaction ID last in PostgreSQL? So right now the transaction IDs are 32 bit and that's part of the reason why the limit is 2 billion. Because basically 2 billion transactions in the past are visible. And you can't basically run out of that because the transaction IDs are split in half. But if we were able to make them 64 bits, and I've seen this is potentially an issue they're considering, that would significantly increase the amount of IDs available. And they have a comparison that they did in this blog post, if you want to check it out, is that if you're doing 1 million write transactions every second, you would hit the Txid wraparound in 292,000 years. So if they eventually did this, this would be a solution to avoid this in the future. Now, related to that, there is in the documentation for postgres a reference to a routine vacuuming and there's a section called Preventing Transaction ID Wraparound Failures. And then right here it has a few scripts, or I should say SQL statements that tell you how to track what the oldest frozen ID is. So basically you don't want these to hit 2 billion. So this is a way for you to track manually or there is monitoring software that can do it for you. So definitely it's something you need to keep on top of the next article is postgres instances open to connections from the Internet. And this is from PostgreSQL note, which is from Verity pro. And so this is basically recognizing the fact that with the rise of database as a service, there are a lot of postgres instances going up where typically due to misconfiguration, they are left open to the internet. And so this has a set of recommendations to follow to make sure that you're locking down your instance. Like for example, listen addresses specifies if you're going to be trying to access it, you need to limit the listened addresses that are used. And then he makes reference to there was actually a bug in 2013 where people could essentially bypass some of the configurations options set in Pghba.com to connect to the database without authentication. So really try to do a defense in depth. If you are trying to connect to your database across the Internet, although the best practices don't do that. Just set your own isolated network at your data provider and have your app servers talk to your database server within that environment. Don't even open it up to the Internet. And if you're even going to consider connecting across it, use SSH tunnels or IPsec or some other technique where you can't just connect up to postgres. Use a defense in depth strategy. And then here they reference the Scarlett Johansson crypting malware that was attacking PostgreSQL servers. And their last recommendation they're talking about here is forbidding non SSL remote connections. Basically only use SSL remote connections. So definitely a blog post to check out if you're doing any of these practices currently because you want to, of course, be safe on the Internet. The next piece of content is actually also from thebuild.com and it was a presentation at Fostim called Breaking PostgreSQL at Scale. Now it says Breaking PostgreSQL, and this is from Christophe Pettis, but it's actually a set of recommendations on scaling and what you would need to do at each stage of scaling PostgreSQL. So I thought this was a great analogy. I don't quite know how the scale is, but basically he starts at 10GB. Okay, what would you need to do in terms of scaling with that? And then he moves up to 100GB. What would you need to do? Here what happens at a terabyte, ten terabytes and beyond. So each of the stages he's going through his recommendations on what you would need to do to your database to be able to handle the data from considerations of how do you handle backup, what about indexes and different issues associated with it. So, in terms of scaling postgres, this is a great piece of content. So I highly suggest you check out this presentation. Now, they have started releasing videos from Fostim, so that may be coming up. So I may be posting the YouTube video of this next week. We'll have to see if they post it. [00:08:18] The next piece of content is the most useful postgres extension, PG Stat statements. So this is basically a description on how to set up PG Stat statements, where I'm creating the extension, and then how to use it to be able to track what queries are happening with what frequency and how much time their execution is taking in your database. So if you're not using it, definitely a blog post to check out. Now, this didn't mention it, but one thing you have to consider when you're using PG Stat statements, because it is an aggregate count of things, is how often to reset it. And so that function you can use is PG Stat statements reset. So you're going to have to think about how often you would want to reset that, if it's monthly or a quarterly or some periodic basis to essentially start from a new base to collect your statistics. [00:09:09] The next post is PostgreSQL Connection pooling with PG Bouncer and this is from PG IO and it basically goes through how to set up PG Bouncer from CentOS to Ubuntu and how it's a connection pooler and a proxy for your database to be able to handle more connections without using up a ton of connections in PostgreSQL. So this was a pretty easy walkthrough and it gives you some recommendations even on where you would typically set up your PG Bouncer and how to basically get started with it. So if you haven't got started with it, this is definitely a blog post to check out to start learning how to use PG Bouncer. The next piece of content is a YouTube video from Fossim and it's PostgreSQL goes to eleven. So the first half of this talks about how committing and new features get added to Postgres. So that's perhaps the first half or 1st 20 minutes and then the last half talks about the new features that came in at eleven and then what's potentially coming in twelve. So if that information of interest to you, definitely check it out. The next video, also from Fostim is PostgreSQL versus F sync. So this is talking about the F sync issue with PostgreSQL where it relies upon the kernel of the operating system it runs on to actually flush data in buffers to disk and if there's ever an error with that, the data kind of gets discarded as opposed to being retried later. So this is an issue that they're dealing with and they're trying to come up with a solution, I believe they said in PostgreSQL version twelve. Now this is a super rare issue or it has a low probability of occurring, but if you're not aware of this, this was covered in a previous episode of PostgreSQL, but it's definitely something to keep in mind to keep an eye on because maybe you want to try to upgrade your version sooner than you would normally. If a feature comes out in PostgreSQL version twelve that helps mitigate this issue. [00:11:13] The next Post is using pgrep to rebuild PostgreSQL database objects online. [00:11:20] This is from the Procona.com blog and this is about using Pgrepak. So depending on how you have Postgres configured, it generates a certain amount of bloat during its usage. Because rows are not immediately deleted or updated, a new row is inserted and then the old one gets deleted so that can generate some bloat. And if a vacuum isn't occurring frequently, that can cause bigger bloat problems. Now, trying to vacuum everything and resolve a lot of bloat requires doing a vacuum full, but that locks the whole table for reads and writes. However, Pgripak is a separate extension that actually lets you do an online rebuild of a table that actually won't lock the table. So if you're interested in using this extension, definitely Blog Post to check out. Next post is PostgreSQL with Passphrase protected SSL keys under system D. So this is talking about you can enable SSL with PostgreSQL and a lot of times you may not password protect its SSL key because that means when you try to restart the service, there's no way to put in the key and you can't start PostgreSQL. So a lot of times if you use it, you leave the key off so it can just restart easily. But if you wanted to use Passphrases, it actually gives a few recommendations in here on setting it up such that you can do that using the SSL passphrase command. So if you're wanting to use passphrases with your SSL keys, definitely a blog post to check out. [00:12:53] The next post is implementing as of queries in PostgreSQL. So this is basically using time travel to look at the state of a table based upon what existed in that table at a particular time frame. So it's mentioned here that Oracle has this type of functionality but PostgreSQL does not. But basically this is a way you could do it. So it does require setting up a table in a certain way. So they've demonstrated using some time zone ranges in an exclude constraint, constructed a certain way to be able to mimic this functionality. So if this is something you're potentially interested in, definitely a blog post to check out. [00:13:35] And lastly, if you're interested in more PostgreSQL internals, there's sort support sorting in Postgres at speed and this is from Brander.org and it basically talks about different sorting methods that PostgreSQL uses and goes into a lot of detail about specifically sort support to be able to accelerate sorting. So if you're interested in a lot of the internal details about that, this is definitely a blog post to check out. [00:14:02] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show Notes. Be sure to head over to Scaling where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 368

Postgres 18: The Deep Cuts | Scaling Postgres 386

Episode 336

More Postgres 17 Highlights | Scaling Postgres 336

Episode 369

Move Off Of RDS? | Scaling Postgres 369

Outage from TXIDs, Breaking Scale, fsync Gate, pg_stat_statements | Scaling Postgres 50