Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about Zheep status, sequence gaps, efficient pagination and setup. Petrone. I'm Kristen Jameson and this is scaling postgres episode 175.
[00:00:22] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is PostgreSQL Zheep current status. This is from CyberTech Postgresql.com, and this article is talking about Zheep and its status. Now, Zheep, as a reminder, is a different storage engine from Postgres'typical Heap. And what it does is it supports a redo or an undo operation, similar to how Oracle handles multi version control. So the way Postgres works now is that when a row needs to be updated, it actually doesn't update that individual row. It creates a new row and then marks that old row for deletion, but it's able to maintain that row for other sessions that still need it. In other words, that's its MultiVersion concurrency. But the way Oracle works, which is similar to how Zheep is intended to work with Postgres, is it maintains an undo or redo log. So the rows do get updated in the table, but the old values of that row get stored in an undo log so that if you need to roll back or look at something previously, it can consult that undo or redo log. So this post goes over Zheep's current status and it mentions that they're working on logical decoding, which is the process of being able to read the wall and decode it back into SQL. This is for solutions that do change data capture, and so Zheep will have to be decoded to support that feature. There also seems to be an undue infrastructure that Postgres is working on that may make it into Postgres 15. And basically Zheep wants to work with this undo infrastructure because they want this type of undo redo infrastructure to be able to work with more than just Zheep. So it looks like a general undo infrastructure they are creating. So they're working on Zheep to get it compatible with that. The other thing they're doing is a lot of fixing bugs and merging and they're showing all the code that's been developed so far for Zheep here, as well as continuing to improve Zheep. Like for example, one area they're working on now is the discard worker, which would essentially go through the undo chunks to throw away rows that are no longer needed, similar to, I guess, how vacuum works. Now, they do have a build that has Zheep enabled that you can check out here and you can create a table using Zheep. So you just add this when you create the table using Zheep and it will start using this storage engine for it. Now, the advantages of Zheep is that there should be much less of an overhead load for tables that get highly updated. Because with postgres now and it's standard heap if you have a single row and you update a million times. It's going to create a million rows in that table because each one is a new row and it saves the old version. Then all of those need to be vacuumed up. But with Zheep the same row would be updated a million times. Now, it would log those million changes in the undo log or the redo log to be able to roll back or maintain different versions, but those would be deleted eventually and you wouldn't bloat your main table. And in terms of Zheep, they are asking for assistance. So if you want to help test out or lend your assistance in any way, go ahead and reach out to Cyber Cock and postgresql.com the next piece of content. One, two, skip a few. This is from Incident IO and they're talking about they have some incrementing incident numbers. So each customer has their own set of Identifiers and they're identified by Incident One, incident Two and another customer would have incident One, incident Two. So they're unique for each customer. It's not a globally unique number. Now, to implement this they used sequences but then they ran into a problem of support, is that they were getting a lot of gaps which if you know anything about sequences, is they are not guaranteed to be gapless. There's a number of situations where gaps get introduced but in this post they describe how they set up a trigger and used a sequence to be able to track and give new numbers for each incident and then identified cases where if you have a rollback you can get gaps in the sequence. So for example, here they insert two values, a and B. Then they insert C and D but do a rollback before those are committed. Then they insert ENF and you can see that C and D are missing but so are their sequence numbers. So that's a way you can get gaps. But that wasn't the case that they were experiencing for their end users. And what they identified is that the gap that they were seeing was 32. Now, what's interesting about this is that this is actually in the postgres source code, the number of sequence values that are pulled for each time and they say here, quote in the source code we don't want to log each fetching of a value from a sequence so we pre log a few fetches in advance. So namely 32. And in the event of a crash we can lose or skip over as many values as we prelogged. So that is another case. If your postgres crashes you could get gaps in the sequences because it's pulling 32 at a time. But they hadn't had any crashes. So again, they looked more into it and what they discovered is that they did have an upgrade and they switched over to a new system and that essentially caused the gap that they experienced. So basically, if you want this type of gapless, incrementing, don't use sequences for that purpose. Now, the way that they ended up implementing it was getting a max of the ID per customer and then using that value as the next Identifier for the row. Now there is of course overhead with this because every insert you're basically needing to do a select to identify what Identifier to use, but this works in their case. But if you want to learn more, you can check out this post from Incident. IO the next piece of content efficient Pagination in Django and postgres. This is from Pginalyze.com and they're talking about how for frameworks in general and in Django, the Pagination tends to use limit and offset. And then what it also tends to do is do a select all of the rows from a table or whatever list you're paginating through, because it needs a full count of the list to be able to know how many pages to put out there. And this implementation could get really slow as your number of records increases as well as the further you go out in the offset because the more records the longer this query will take to run and then the further you go out in the offset it has to scan through all those records to get to it. So they were seeing something earlier on taking 89 milliseconds and then when they went out to 5 million of a 10 million record set they were seeing over 2 seconds and then over 4 seconds to just get a count of all the rows so very very inefficient. Now they did some adjustments, some hacking of the library as it were, to try and speed this up. They tried removing the count query so basically it would just have an open ended, you don't know how many total pages you're supposed to have and of course that helped. They also tried approximating the count. Now you can do the number of rel tuples from PG class to get an estimate, but that's not going to work if you're looking at a subset of the table. If you have a multi tenant application where multiple customers use a segment of the table, this solution won't really work. But the solution that's the most efficient is this keyset Pagination. So basically you use some sort of indexed Identifier. In this case it's probably the primary key, an ID. So basically your offset is the ID you specify and then you do a limit by ten. So here it can use the index that's in place to quickly identify the record and then just get the next rows from it. So this is the most efficient solution to this type of Pagination problem and they talk a little bit more about some of the plugins and some of the adjustments they did, but the key set Pagination is the way to go. So if you want to learn more about that you can check out this blog post.
[00:08:31] The next piece of content how to set up PostgreSQL High availability with Petroni. This is from Archetype.com and this walks through the process using four nodes to set up a highly available PostgreSQL instance using Petrone. So basically on one node they installed Petrone and Postgres. On a second node they also installed Petroni and Postgres. So these would operate as the primary and one replica. Then on a third node they set up etcd. Now they also mentioned you should of course have more than one etcd instance set up to potentially avoid split brain scenarios, as well as an Ha proxy instance to identify where to send the primaries traffic. And this blog post runs through all the commands to run on Ubuntu 24. So from installing the prerequisites Postgres, Petroni etcd and Ha proxy and all the configuration that you would need to do to get it up and running. So it's a pretty comprehensive blog post. So if you're interested in setting up Petrone, I definitely encourage you to check out this blog post.
[00:09:37] The next piece of Content PostgreSQL Extension Catalogs this is from Fluco 1978 GitHub IO and he's talking about three views that are available in Postgres. They show you the PG extensions that are installed. They show you the extensions that are available for you to install, and usually the default install of postgres, at least from the package managers of the operating systems, generally have a lot of versions available that you can install, but then there's also a different version of those. And this View PG available extension version shows you all the different versions that you can install of those extensions. So I encourage you to go ahead and check out this blog post if you want to learn more about these views and how you can use them to help manage your extensions.
[00:10:23] The next piece of Content the amazing Buffer tag in PostgreSQL this is from Haigo CA. They're talking about the buffer tag in Postgres that allows Postgres to identify a specific file and block it needs to update. And basically this buffer tag contains five numbers that identify table space, the database, the table, the fork number and the blocked number. So this blog post goes into how these work and explains how Postgres uses these under the covers to identify where data is stored. So if you're interested in that, you can check out this blog post.
[00:10:58] The next piece of content maps with Django. Part Two geo Django, PostGIS and Leaflet So this Post walks through setting up a mapping solution in Django, the web framework for Python, and they use the library Geojango. They use PostGIS with Postgres, of course, to store the data and then Leaflet, which is a JavaScript library for interactive maps. Now, this is quite a comprehensive blog post that even has this very comprehensive index here, but basically at the end of it all you can get map markers rendered on a map. So if you want to learn how to implement this type of solution. Definitely check out this blog post next piece of content using PostGIS and PG featureserve with QGIS. This is from Crunchydata.com and they're talking about how to use PG featureserve to output data using the OGC API for Features protocol to feed data into a GIS application. So if you're interested in doing that, you can check out this blog post.
[00:12:01] The next piece of content also from Crunchy Data, is waiting for PostGIS Three two St make valid. So this is a new function in PostGIS three two that allows you to correct invalid geometries. So if you're interested in that, you can check out this blog post next piece of content also from Crunchy, Data is waiting for PostGIS Three two St Contour and St Set Z. So these are two additional functions that allow you to use Raster data to set up contours and finding the value of a Raster point. So if you're interested in that, you can check out this blog post.
[00:12:38] The next piece of content is a PostgreSQL person of the Week is Rafia Sabi. So if you're interested in learning more about Rafa and her contributions to postgres, you can check out this.
[00:12:51] And the last piece of content is we did another episode of the Rubber Duck Dev show this past Wednesday. This one was on object oriented versus functional programming. Our next live show will be Wednesday at 08:00 P.m., where we will be covering exception handling, so I hope you will join us.
[00:13:09] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube itunes. Thanks.