Easy Recursive CTE, Zheap Undo, High Availability, Loading Data

[00:00:00] In this episode of Scaling Postgres we talk about easy recursive CTE. Z heap undo high availability in loading data. I'm Kristen Jameson and this is Scaling Postgres episode 196. [00:00:22] Alright, I hope you, your friends, family, million coworkers continue to do well. Since it is the holiday season, I'm going to be taking about a week and a half break. So the next episode of Scaling Postgres will be happening in early January and I hope all of you have a great holiday season. Our first piece of content is learn how to write SQL recursive CTE in five steps. This is from Dev. Two in the yogabyte section. And what I thought was interesting about this post is that they really break down into five steps how to write a recursive CTE because a lot of times someone just gives you a result of recursive CTE and not how it's built out, whereas that's exactly what this does. So step zero is basically create an employee table and a recursive CTE is just something that references itself. And this example here is showing people who are managed by other people. So they have a table of employees and they have a manager ID that basically points to another employee record. So that's how the self references are set up. And he even gives an example of a foreign key reference that mentions that. So the first step is to find the root or basically what is level zero and it's where the manager is null. So that's essentially the president, no one else is managing him. And this is a very simple query to get you exactly that information. Now you'll see references to Yugabyte here, but of course this works with Postgres as well. Next step is define this level zero in the with clause. So basically he just rewrites this very simple query just using a with recursive statement. It just basically just does one step, but it gives you the exact same information you were seeing above. The next step three is define the join to the next level. So basically he joins that one row, he receives it, joins it to the employee table again to pull out who that top person is managing. So now you can see the three records that the top person is managing just by simply adding this join within this section here. Step four is to concatenate the two levels with a union. So basically the first one showed this record, the next one showed who's managed by him and basically he just did a union all to put those together. So essentially this gives you the first two levels. But to make the leap to do it recursively, you need to put that union all actually in the recursive Cde. So that's what this step does and he says get it in one recursive with clause. So he moves the union that was here, moves the queries and the unions up here. So you get it all in one with clause, and that's how you set up a recursive CTE. So I really like the way he laid it out. And if you're interested in finding out more, you can check out this blog post. [00:03:10] The next piece of Content Zheep undo logs discarding in PostgreSQL. This is from Cyber Coffin Postgresql.com, and this is talking about the Zheep project, essentially. Basically, it's creating another information store that's separate from the standard heap where that table history is recorded in an undo log, as opposed to being stored directly in the heap using row versions. Now, the benefits of this is that you don't have to use vacuum to handle that. Basically, there's a separate process that would take care of those undo records. It's stored in a separate place, much like Oracle does it. So the benefit is avoiding table Bloat. And this can particularly impact when a table gets tons of updates because essentially in the standard storage engine of Postgres, you create a row for each update that happens. So you can easily get a Bloated table by doing that, whereas with a Zheep table storage, the update is done in place, but it creates undo records to reference all of those versions. So this is essentially an update on the project. And he says, how do you get started with Zheep? And they provided a container here. Unfortunately, what they're saying is that Zheep is too invasive to be done as an extension. But my assumption is that this will eventually become a core part of PostgreSQL once it's been vetted and looks to be working. Okay, so right now this is still a technical preview, but once it's working as expected, I'm assuming it would just be a part of the core and not considered an extension. It would just be another storage engine you could use. But once you get that docker container set up and get it working, you go ahead and get it set up. He did it for his entire session, setting the default table access method to Zheep, although you can do it at table creation time if you want to instead. So it's basically just a different way to access table information. And then he tested Zheep undoes when you have an insert load. So he inserted it looks to be 10 million records. Here you can see as he shows here in the base undo directory, there's a large number of files that get generated by all of these inserts. Because essentially Zheep stores the undo in this separate location. It's not a part of the table. So until the insert is actually committed, you're going to have all of these files around. And he shows that here. Once it's committed, then you take a look at this directory. Now it has much fewer files. Now, part of what this post describes is how this removal of the undo happens. And it actually uses a separate worker that's been set up for C heap called an undo discard worker so that's the one responsible for removing no longer needed undo once all changes have been committed and no other connection needs to see that information because it still needs to comply with MVCC concurrency control. Then he tests the CEAP undo with an update load. So essentially updates that table and updates it 10 million times. And he says, as you can tell, there's a ton of wall that gets generated, so it's still a write heavy process. And again, this is also in the base undo directory, but once it's committed, the directory shrinks back down to size. And the important thing he mentions here is quote, the cleanup is not part of a commit, but it's actually done by the discard worker, which is in charge of making sure that the log is not killed too early. So again, has to ensure that there are no other transactions running that need access to that undo information before it does it. But this is another update to the Zeehap project, something that I like keeping track on because I think this could prove beneficial in some table. Use cases with PostgreSQL to hopefully avoid bloat, but if you're interested in learning more, you can check out this blog post next piece of content. Postgres ha roles are dynamic. This is from Tapoeh.org, and he's talking about postgres high availability. Specifically, he's talking about how your different nodes in a High Availability setup must be dynamic. So typically postgres, you set up a primary and it's always going to be a primary until you essentially decommission it. And then you have a Replica that you set up and it will continue to be a Replica until you promote it. That's pretty much how it is. But with a High Availability setup, you actually want the primary to float back and forth between different machines as necessary. Maybe you want to upgrade a particular machine, therefore you want to change which server the primary is, and potentially you may want to flip back to it too. So this post makes a point when you're talking about High Availability, the roles being dynamic. So what's primary, what's secondary needs to dynamically change? And essentially you need a tool to be able to handle that, because postgres by default really doesn't handle that capability. Basically, you can make an old primary catch up as a Replica by using PG Rewind, and you can of course promote a Replica to a primary using a simple command. But being able to change roles back and forth requires additional tuning. Now this post is by the person who has authored PG Auto Failover, so he definitely mentions it more than once in this post as a method for doing ha. So that's definitely something you can check out. But this post also covers a lot of issues related to dynamically changing these roles. And things you need to keep in mind, like capacity planning. So you need to make sure that your Replicas are essentially identical to the primaries or can handle the load, and then also discusses things that can impact your long term maintenance of this. Specifically, he's talking about when you're choosing what to use for your Replicas. A lot of times Replicas are sitting there with nothing going on, so people want to start using them as a read only source, so they start sending traffic to it. But then what happens if you have a failover? Now, can your Replica handle both the primaries traffic and the readonly traffic it's receiving to, or do you need additional instances to handle that? So, it's a lot of different things to consider when you're setting up your High Availability solution. Then he talks about the application side of things, handling connection pooling, as well as different disaster recovery scenarios. So definitely a long post that talks about a lot of issues with regards to High Availability. So if you're interested in that, definitely check out this blog post. [00:09:33] The next piece of content the fastest way to load data into Postgres with Ruby on Rails. This is from Pginalyze.com and the post describes exactly what he says. And he's got this convenient table up here that shows you different speed times. So if you're going to insert one record at a time, to insert a million records will take you 1.3 hours, which seems really long. If you do a bulk insert with Active Record Import, it happens in 5.1 minutes. So what is this? So basically an insert statement can contain one row of information you want to insert. That's pretty much what this is. You're inserting one record at a time. Although an insert statement, you can place many rows of information to be inserted. So you could have an insert statement that has say, 100 rows of data that will be inserted. That's essentially a bulk insert. And specifically, he's using a separate library here called Active Record Import to do it. Although you don't need to use this, you can send a multi row insert statement to Postgres without a problem using different languages. The next one he talks about is the PostgreSQL Copy and he again uses a separate library called Active Record Copy. But in my experience, Copy is the fastest way to insert data into Postgres. And you don't necessarily need a separate library. I guess this has some convenience features to it. Like if I'm receiving data, I basically would ideally want to have it as a CSV file and I use the Postgres Copy command to quickly load it into Postgres. Now, his 1.5 minutes here, that is super slow because I've done a separate tutorial video where I inserted 100,000 records in just over a second. So something else is going on here. Either this library is causing some slowdown issues or the other thing to take into account is that maybe there are indexes that are present. Because of course the fastest way to insert data is with no indexes on the table nor triggers or things of that nature. So if you want to get the maximum speed, you want to drop any triggers or indexes before you do the load. If you can do that, a lot of times you're not going to be able to do that, but that's definitely the fastest way to insert the data. And then he talks about using background jobs and basically what he's saying, you kick it off to a background process that does it so it looks to the end user like it's less than 1 second. But basically when you're inserting data, doing a row at a time is going to be the slowest. A multi row insert is going to be in the intermediate speed, but the postgres copy command gets you there the fastest. And if you want to do it as fast as you can drop indexes triggers, that should get you the most speed. But if you want to learn more about this, you can check out this blog post. [00:12:15] The next piece of content explore the new search and cycle features in PostgreSQL 14. This is from AEven IO. He's talking about new features related to recursive queries. Now this post first talks about setting up a database that has tables with different locations and it's basically a traveling problem. And it's asking the question, what are the different routes to get to different locations? So he set up a trips table and set up all the different routes between the cities and their cost in terms of Euros. Then he set up a query here to show you where you can go in Rome. So this is an example of what it looks like. And then as he says, he adds more hops to the journey. So using a recursive CTE from a previous post, he set the budget at €800, which is here to see how far they can travel. Essentially this comes up with about 89 rows and the results look similar to this. But then it gets into discussing search. So that's the new Postgres 14 feature where he says you can define your exploration path. And search allows you to do either a breadth option or a depth option. So a search option basically gives you breadth, so it shows you the shortest routes going up to the longer routes, whereas the depth option shows you how deep you can go for a particular trip. And in addition to that, he shows you the cycle option where you can avoid loops, so you could avoid going back to the same city. So it just shows you a single loop to each of the cities. So this was an interesting post of exploring the new search and cycle capabilities with recursive CTEs in Postgres 14. So if you want to learn more, check out this blog post. [00:13:57] The Next Piece of Content PostGIS Nearest Neighbor Syntax so this is a way to get only one distance calculation and one target literal from an end nearest Neighbor search using PostGIS. So if you're interested in that, you can check out this blog. Post the next piece of content. UK dashboard built using Postgres and Citus for millions of users. This is from Citrusdata.com, and this is with regard to the ordeal that's been happening in the last couple of years, I'm not going to mention the name of it, and I don't normally cover posts like this. This is essentially a case study for Citus. However, this had some pretty interesting details about the data, the volumes, how they set things up, dealing with partitioning of the data, and some issues they were encountering with processing of it. So I found it a pretty interesting read. But just a warning, it is quite long, but if you want to learn more about this project, you can definitely check out this blog. Post. [00:14:58] The next piece of content manage encryption keys with PostgreSQL TDE. This is from CyberTech Coffin Postgresql.com. And they have an open source tool they call PostgreSQL TDE, which does on disk encryption for postgres. So it's provided as a separate extension that you can use. This specifically talks about how Postgres handles the disk encryption. And basically on startup there's a flag called Encryption Key Command, and basically you provided a command that can then retrieve the key from any key source that you want for handling the encryption of your database data. So if you want to learn more about that, you can check out this blog post the Next Piece of Content high Availability and Disaster Recovery Recipes for PostgreSQL on Kubernetes this is from Procona.com. This talks about different ways of using their postgres operator for doing High Availability a multidata center with multiple availability zones, how to handle vertical scaling, horizontal scaling, handling, disaster recovery, backup and restores, as well as continuous restores, which are kind of like log shipping to a separate region, for example. But you can check out this blog post if you want to learn more in the next piece of content. The PostgreSQL person of the week is Asutosh Babat. So if you're interested in learning more about Asutosh and his contributions to Postgres, definitely check out this blog. Post and the last piece of content. We did have another episode of the Reproduct Dev show this past Wednesday evening. This episode we talked about Code quality analyzers, and more specifically, Ruby Critic for Ruby. But of course, there are other tools that can analyze your code quality. So if you're interested in some long form developer based content, you can definitely check out our show. [00:16:48] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show. Notes be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks. Bye.

Show Notes

Episode Transcript

Other Episodes

Episode 228

New Postgres Releases, Privilege Escalation CVE, Chaos Testing, High Availability | Scaling Postgres 228

Episode 347

Easy Incremental Data Processing? | Scaling Postgres 347

Episode 23

UUIDs, Write Spikes, Distinct, SQL Standard | Scaling Postgres 23

Easy Recursive CTE, Zheap Undo, High Availability, Loading Data | Scaling Postgres 196