Safety Systems, Failover Slots, Transaction ID Complications, Repartitioning | Scaling Postgres 118

Episode 118 June 14, 2020 00:15:51
Safety Systems, Failover Slots, Transaction ID Complications, Repartitioning | Scaling Postgres 118
Scaling Postgres
Safety Systems, Failover Slots, Transaction ID Complications, Repartitioning | Scaling Postgres 118

Jun 14 2020 | 00:15:51

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss safety systems, the purpose and existence of failover slots, complications with transaction IDs and how to repartition without downtime.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/118-safety-systems-failover-slots-transaction-id-complications-repartitioning/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about safety systems, failover slots, transaction ID, complications and repartitioning. I'm Kristen Jameson, and this is scaling postgres episode 118. [00:00:18] One. [00:00:23] All right, I hope you, your family million coworkers continue to do well in the times that we live in. Our first piece of content is safety systems can reduce safety. This is from Momgm us. And he typically does quick little blog posts that are a paragraph or two. This one's a little bit longer. But what he had to say about safety systems I found particularly interesting. He's basically mentioned three instances where actually a safety system caused a problem. So for example, caused a jet to crash, caused the Three Mile Island nuclear reactor meltdown, caused the Chernobyl nuclear meltdown. So basically the safety systems introduced to protect something caused a problem. And he's looking at this from the perspective of PostgreSQL and he mentions a few different safety systems here with regard to backups, error reporting, replication and connection pooling. And talking about backups, he looks at it through a few different lenses. How serious is the failure that the safety system is trying to prevent? High how likely is the failure that the safety system is trying to prevent? High how reliable is the safety system? About a medium. And then what impact will failure of the safety system have on the overall system? And it's relatively low. And he says, well, maybe you run out of disk space while the backup is occurring, for example. But usually you're not going to cause the system to go down by taking a backup. Maybe it has a little bit of a performance impact, but not very much. And what I find interesting about this and thinking about it is that what about other areas like he mentioned replication. So if you're going with streaming replication, the one that causes the least impact is asynchronous replication. But then once you turn on synchronous replication, a safety system to ensure you don't lose any data, well now you have essentially two servers that always need to be operating to keep synchronous replication going. Maybe you can add more, but each one you add, you're adding another point of failure. So this safety system, each time you add to it, you're introducing another potential source of failure because your database will stop working if the synchronous replication stops working. So essentially the safety system causes a problem to happen. And then he talks about connection pooling and that maybe you want a connection pooler. Well now that is a source of error. Well, maybe you want to introduce two connection poolers. Well now you've got more complexity you need to coordinate and deal with to make sure that those work. And the same thing happens with high availability solutions. So those manage the switch over from one database to another if there's an issue. But then what if there's a failure in that system? Could that actually cause downtime? So not a long post, but definitely something to consider as you add these new more safety systems, do they really create a more safe system? [00:03:16] The next piece of content is failover slots for PostgreSQL, and this is from secondquader.com. Now, I will mention that I thought I recalled an earlier version of this post, but this is mentioned as being June Eigth 2020. So I don't know if this has been updated or if this is a new post, but I have seen a number of other articles talking about failover slots. And this is the concept that a replication slot, when you fail over from a primary to a Replica, any replication slots do not get carried over, they stay with the old Replica. So you need to recreate them on the new Replica. And this can cause problems with things like they mentioned here, logical decoding, where essentially you lose the place that you were at and you potentially have to make adjustments to handle it. And they talk about the failover problem here with the fact that the slot does not get crossed over. And the issue is because you have to track not only the existence of the slot, but where it is at in the wall file, so that when failover occurs, the slot knows exactly where it needs to be and it talks about a patch with regard to it. But like I said, I don't know if this is an old post because I don't see any reference to a link to the patch or whether it's a patch that's potentially coming in postgres 13 or a future version 14. So I don't know the state of this, but even then it brings out an important thing to keep in mind if you're using logical replication or logical decoding with postgres, the fact that the slots don't get copied over and you need to handle that case. So if you happen to know the existence of a patch for postgres 13 or 14 that may address this, go ahead and let me know. [00:05:00] The next piece of content is PG friday ten Things Postgres Could Improve part one, and this is from secondquader.com as well. Now, interesting, they say Ten things postgres could improve, but I don't see a listing of ten things, but they're listing four things that they're going to address in four parts. So I was a little confused by that. But basically this addresses part one of their things that could be improved and it covers transaction ID complications. [00:05:33] So this is from a post that we covered in a previous episode of Scaling Postgres, talking about ten Things I Hate about postgres. So it's a little play on that. But they're focusing on the transaction ID complications. And this is the fact that a transaction ID is a 32 bit integer and that means there is a finite number of active transactions you can have. So you can run into problems if you're not actively vacuuming up your system. If you have long running transactions, you could run out of those active transactions essentially within tables you're working with, or for using things like prepared transactions. So there's a few different scenarios where you can run out of these transactions. So you really want to be aware of these possibilities. Now, how can you protect yourself for it? And really the number one way is to monitor it, to monitor the state of it across all the tables and the database to make sure that you're not going to run out of these transaction IDs. And the second thing of course, is to make sure your table is well auto vacuumed and that you're freezing these IDs so that they are available for reuse, as well as make sure you don't have a lot of long running transactions, especially idle in transaction sessions. And in terms of the feature, they are looking towards things like maybe making the transaction ID a 64 bit number just to increase the runway. Or alternatively, maybe a new storage system like Zheep could offer an alternative. So if you want to learn about transaction ID complications, go ahead and check out this blog post. [00:07:10] The next piece of content is Control runaway postgres queries with statement timeout. This is from Crunchydata.com and it's talking about Statement Timeout, which is a configuration option that enables you to timeout statements that exceed a certain duration. [00:07:25] Now, you can set this on the cluster by session. They are even talking about per user, per database. So this helps you avoid the kind of things that we were talking about in the previous post of having too many long running transactions within your database system. Now, one thing he says he likes to do is set at the database level. Set the statement timeout to 60 seconds so that no statement can exceed that duration within that database. And if you have to run a longer query, then in whatever session you're using, just do a set the statement timeout to a longer duration and then once you close out of that session, those settings will be forgotten. So if you want to learn more about the Statement Timeout configuration option, check out this blog post. [00:08:08] The next piece of content is repartitioning with logical replication in PostgreSQL 13. So this is from Agonder net and he's talking about a way to repartition a table from say, a daily partitioning scheme to a weekly partitioning scheme, although you could migrate it one way or the other. And he's looking to do this without downtime. So clearly you can remove access to the system and be able to do it that way. But how could you do it without downtime? Now he says, quote but first, a disclaimer. This is definitely not pretty, but apparently it does work as a quick hack. So he has one table that is partitioned daily and he inserted some data into it and then he set up a new table with a weekly partitioning scheme. Now, in order to do this, you actually need to work with two databases and create a publication from one table to another table in another database, and then from that database, create a publication from that table in the secondary database back to the first database. And he goes through all the different code here. But the description of all that he's doing and kind of how it works is listed out in this eight step process here. And it's pretty much what I described. You replicate from one table in database One to a second table in the second database, and then replicate from that to another table in database one, all using logical replication. And that should be able to repartition your data without encountering any downtime. So if you're interested in learning more about this technique, definitely encourage you to check out this blog post. [00:09:46] The next post is Multimaster Replication Solutions for PostgreSQL. This is from Procona.com, and they're basically describing different solutions that do multimaster replication. They cover some key concepts with regard to database replication, synchronous replication, asynchronous replication, a single master replication, which is essentially what Postgres does out of the box, only a single master solution. But then they go into multimaster. Talk about some of the pros, then of course, some of the cons with regard to it. I'll just mention some of the cons. The main disadvantage of it, of course, is complexity. Second, conflict resolution is very difficult because you can have simultaneous writes on more than one master at a time. So if someone updates the same data in the same table on two or three of the servers, which one wins? And then sometimes manual intervention is necessary in that case. And then they mentioned the different solutions that they're aware of. The first is BDR, which is the bi directional replication product of Second Quadrant. And they discuss that. They discuss XDB. They discuss PostgreSQL Excel. PostgreSQL XC and XC. Two Ruby rep and Bucardo. So if you're interested in looking into a multimaster database solution built on Postgres, definitely check out this blog post and the different solutions offered. [00:11:14] The next piece of content is composite type performance issues in PostgreSQL. This is from CyberTech postgresql.com. By composite types, they're talking about custom data types that you can create in Postgres. So for example, you can create a type. This one they've chosen to call Person, and it includes an ID, a name and an income. And that one type is encompassing all of these different types. And you can store this in a single column. And you can see here that the type of that column is person when you create it with that data type of person. Now, one thing they say here is be careful about database performance. Now, they didn't use this example. They actually moved to using the PG Stats Tuple because in terms of what it returns from when you call it, it returns a composite type, but based on how you query it. Like if you query it this way, it returns in almost 2 seconds. Whereas if you actually do a normal select with a from clause, it returns in 200 milliseconds. He says this is because when working with a composite type, it actually goes through and calls this function for each column, apparently. So it's just something interesting to be aware of is that depending on how you use composite types, you could run into a performance issue. So just be aware of this if you use them. Personally, I haven't used composite types in my database design. I tend to use non composite types. If I do use a composite type, I generally use JSON B fields, but go ahead and check out this blog post if you want to learn more. [00:12:49] The next piece of content is transactions in PostgreSQL and their mechanism. So this is talking about transactions, and it covers essentially normal transactions that are auto committed by default in postgres. And they talk about that and show different ways it gets executed. And they start digging deep into how this actually works, and they query the table to look at the different hidden fields and how that tracks and manages that transaction. They talk about subtransactions that can happen within a primary transaction. They talk about multitransactions that cover basically row level locks, and then finally they follow up with two phase commit transactions and how they work. So if you want to deep dive into the technical details of how transactions work in postgres, definitely check out this blog post from Higo, CA. [00:13:42] The next piece of content is a set of two posts developers Diary One and Developers Diary Two from Paul Ramsay at Clever Elephant CA. And he's talking about enhancements that they've been doing to postgres with regards to performance. So if you're wanting to learn more about the work he's been doing to improve performance of large object manipulation in PostGIS, definitely check out these two blog posts. [00:14:12] The next piece of content is using the Crunchy PostgreSQL operator with PostGIS. This is from Crunchydata.com, and they're talking about exactly what it says using the Crunchy Data Kubernetes operator to set up PostGIS in Kubernetes. So if you're interested in doing that, definitely check out this blog post. [00:14:32] The next piece of content is PG Bouncer 1.14 is released. The main feature they list here is that encrypted Scam Secrets in PG Bouncer can now be used for server side login. So storing plain text passwords in PG Bouncer is no longer necessary to be able to use Scram. So if you're interested in this new feature, definitely check out this new version. [00:14:55] The next piece of content is how to use Ada Boost machine learning model with two UDA PostgreSQL and Orange part Six. So this is the part six of the post from Secondquadrant.com that's been covering how to do the machine learning that they're mentioning here. So definitely check out this blog post to learn more. [00:15:14] And the last piece of content is the PostgreSQL person of the Week is Olexi Kluken. So if you're interested in learning more about Olexi and his contributions to PostgreSQL, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the Show Notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 151

February 07, 2021 00:18:10
Episode Cover

Cleaning Up, Function Performance, 11 Million IOPS, Change Data Capture | Scaling Postgres 151

In this episode of Scaling Postgres, we discuss cleaning up your database, function performance, 11 million IOPS and change data capture. To get the...

Listen

Episode 139

November 08, 2020 00:12:37
Episode Cover

Weekly News, Full-Text Search Performance, pg_cron, Bulk Data Loading | Scaling Postgres 139

In this episode of Scaling Postgres, we discuss Postgres weekly news, full-text search performance, enhancements to pg_cron and the best way to bulk load...

Listen

Episode 228

August 14, 2022 00:12:45
Episode Cover

New Postgres Releases, Privilege Escalation CVE, Chaos Testing, High Availability | Scaling Postgres 228

In this episode of Scaling Postgres, we discuss new Postgres releases, a new privilege escalation CVE, chaos testing a high availability kubernetes cluster as...

Listen