Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about pagination solutions, return modifications, insert deletions, and how to JSON. I'm Kristen Jameson, and this is scaling postgres episode 248.
[00:00:24] Alright, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is pagination and the problem of the total result count. This is from Cybertechyphenposgresql.com. They're talking about a situation where a lot of programming libraries have the ability to do pagination, but nearly all of them ask for the full result set back to get the full count of records to calculate how many pages would potentially need to be rendered. But they also use a specific technique, limit and offset, to do each page. But that results in some performance problems. So in this example here, they generated some fake data that they wanted to page through. They wanted to look at values between zero and ten ordered by the created at date. So they created an index on created and value. Now, normally for the best performance, you would swap these out, but I guess if you want the earliest record to always show, it would make more sense to have the created at column first. Like I mentioned, most of these libraries use offset and limit and they show exactly how that works. Here when you're looking at the first page, and then when you're looking at the 100th page, you'll see offset 4950 records, limiting to 50 for each page. Now, the big disadvantage of this is that as you get to later pages, the performance gets worse and worse because it has to pull all of these 5000 records and then only show the 50 of interest. But it's pretty simple and this is what most libraries tend to use. He mentioned another option using cursors. Now, a cursor generally takes place within a transaction, so that's not necessarily going to help you, particularly with a web application, but you can use withhold cursors and that basically materializes the whole result set that you have. Now, the huge disadvantage of this is it takes a long time to pull back all of this data depending upon how large it is. So rendering your first page could be pretty slow. Secondly, you also need to do a fair amount of management of the cursor and close it properly as well. And the third disadvantage is this doesn't really work well when you're doing transaction level pooling because this depends upon having a consistent session to manage the cursor and have it present. So this has a fair number of disadvantage as well. And then the third option is Keyset Pagination, where basically you keep track of the last value of the page, or I should say the last row of the page that's rendered. And then you use that for subsequent queries when wanting to go to the second page, the third page, et cetera. So it gives you a new starting point and generally you're going to want to use, say, the created at date and an ID, some sort of a unique identifier, just in case something has the same date time. And then with that it does require a slight modification to the index because you're going to want to have the created at the ID and the value. Now this has a lot of advantages. The only disadvantages are needing to modify the index a bit for this specific query. So this may be less appropriate for more dynamic data you're trying to pull. And then they have this great chart here that shows the different comparisons when trying to render page 110 and 100, and how the offset and limit went from one millisecond for the first page to 15 milliseconds for the 100th page. And if you have more pages, it's going to increase from there. Using the withhold cursor, the first page took 642 milliseconds. So that's a huge amount of time to pull all of that data. But the 10th page was able to show in 00:34 milliseconds so much faster than even the first page offset and limit. And this 100th page was 00:51 milliseconds. So very quick, once you have all that data loaded in and then the keyset Pagination was pretty consistent, 1.4 milliseconds rounded up for page 110 or 100, it didn't matter. So usually if I'm going to do Pagination, this is my go to keyset Pagination, but in general, I don't like using Pagination at all if I can help it. I try to purposefully limit what amount of data is returned or generate a report in the background for it, at least for web application. So the next area they cover is what's the problem with the total result count? And basically you need the total result count to enumerate all of the different pages that you could potentially click through. But this is a huge resource hog and as they say, here a quote. The query to calculate the total result set count is usually more expensive than all other queries together and that's definitely the case that I've seen. So basically you don't want to run the whole result count. And his recommendation, which I agree with, is either don't show it and just say Next page, previous page, don't show all the different pages, or do an estimate. And you can do this using an explain plan for what you're querying. So using the database statistics to give a rough count of how many rows you have. But anyway, these are all different methods that you can handle Pagination with postgres and you can check out this blog post if you want to learn more.
[00:05:25] Next piece of content returning modified rows. This is from Sqlfordevs.com and he's talking about potentially rethinking how you perform the update to the database. So in this case, he's doing the update that he wants done, but then returning all of those rows. You get the returned result set of this update task and you get all of the columns back for the rows that were touched by this. So basically you don't have to do a secondary select or do a select first and then do an update. You could also do this with delete statements and even insert statements. So basically whatever rows are the result of these statements, you can return specific columns or all columns from that result set. So this was just a quick blog post explaining how to do that.
[00:06:08] Next piece of content. Easy alternative soft Deletion Deleted Record Insert this is from Brander.org and we covered a post in a previous episode of Scaling Postgres on soft deletion probably isn't worth it. And he said he's seen numerous cases where supporting Soft Deletion where say you added a deleted at column and when you need to delete data, you don't delete the row, but you actually update the deleted at column to a value, something other than null and then that record is considered soft deleted. And when you query the table, you now need to say where Deleted at is not null to make sure you're not getting those deleted rows. But this causes problems with foreign keys and potentially querying the data directly, getting those deleted records when it wasn't intended. This post walks through that, but this is a follow up post to that that walks through the result of making these changes where he's not doing soft deletion. Instead he created a dedicated deleted record table and if data needs to be deleted, it gets deleted, but then inserted into this deleted record table so you still have a record of what's deleted. It's just in its own dedicated table. You don't have to worry about foreign key issues and you don't have to worry about querying the table and accidentally pulling deleted records. And he even made an enhancement here where you could use this simple function such that when you have a delete, it inserts into this deleted record table and it stores all the columns in a JSON field. And to use it, you just create a trigger and say after delete on a particular table for each row execute the function. So I really like this solution and I actually may be considering next time I'm looking at wanting to implement a self delete solution, but definitely check out this blog post if you want to learn more. Next piece of Content how to JSON in PostgreSQL this is from Dev Two in Francesco's area, and this is a summary blog post of how to use JSON with postgres and they're basically links to more comprehensive blog posts about how to use all the different JSON features in postgres. So if you're interested in learning about that, definitely check out this blog post.
[00:08:14] Next piece of content bruce Momgen did a new presentation and this is the PDF of it called Beyond Joins and Indexes and this can be found at momgen us. And this is a follow on presentation to his explaining the postgres query optimizer talk. Whereas this covers everything not related to joins and indexes basically all of these 42 topics here. So it's definitely quite comprehensive and I think it's yes, 69 pages. So if you want to learn more about postgres and query or statement optimization, definitely check out this piece of content. Next piece of Content PostgreSQL largest Commits this is from Peter Eisentrout.org and he looked through the Git log for each version of postgres to see what was the largest commit for it. So as of the most recent release, 15, it was the merge command, which was definitely a huge endeavor. For postgres 14, it was extended statistics on expressions. For postgres 13, it was allow partition wise joins in more cases, so more parallel processing, which is great. Version twelve, the biggest one was partial implementation of SQL JSON path language. That's great. Version eleven was replacing the catalog format mostly. This is for internal developers. Postgres ten was the logical replication feature that was definitely huge and so far for 16, it's on the initial version of the Mason based build system. So definitely something more internally related for this one, but you can check out this blog post if you want to learn more.
[00:09:45] Next Piece of Content superbase Wrappers a postgres foreign data wrapper framework written in Rust. This is from Superbase.com and they're talking about a new framework that they developed that's been reworked to use foreign data wrappers not only against databases, but even third party APIs. So the first release of this supports Querying, Firebase and Stripe. So querying the Stripe API from your database. So not getting a library for your language of choice and communicating with Stripe, but just doing it directly in the database, creating a foreign data wrapper that you could use a direct data call to the API. So this is fascinating to me and they have plans for writing wrappers for ClickHouse, BigQuery and Airtable. So if you want to learn more about this and how to implement it, definitely check out this blog post. Next piece of content horizontally scaling PostgreSQL this is from PG IO and this is a comprehensive blog post of all the different ways that you can horizontally scale postgres. So they talk about read Replicas, where you can get streaming replication set up and be able to read from multiple Replicas. They talk about logical replication and how that enables you to logically replicate individual tables or even specific data by setting filters on columns and rows. They talk about multimaster and how that's not really a thing for postgres, although there are third party tools that offer that option. Then they talk about synchronous replication. So by default replication is Asynchronous, but you can make Replicas synchronous and ensure that once you commit to the primary, it's actually committed in some other databases as well. And then they follow up with foreign data wrappers and discuss being able to communicate with other types of data sources with your database. So check out this blog post if you want to learn all the different ways where you could do horizontal scaling.
[00:11:36] Next piece of content Upgrading PostgreSQL extensions this is from Procona.com and they're talking about when you have an extension you can actually check its version and even upgrade it to a new version when it comes out and they give you the commands and show you how to do it here.
[00:11:51] Next piece of content a PostgreSQL developer's perspective. Six interesting patches from November's commit fest. This is from Timescale.com and some of the commits they are talking about is two security improvements. One is using regular expressions for usernames in Pghba.com. The second is allowing particular users to vacuum and analyze tables. Next two cover transparent column encryption capabilities. The next one covers groundwork to move from 32 bit transaction IDs to 64 bit ones. Again, this doesn't get a 64 bit transaction IDs, but it starts laying the groundwork on getting to that point. And then the last one of note is doing logical replication for DDL so it already replicates Inserts Updates, deletes Truncates, but this one would also do potentially table changes, so actual schema changes as well. So definitely check out this blog post if you want to learn more. Next piece of content using PG bench to load data faster and the Random normal function in PG 16 this is from Pganalyze.com and this is their five minutes of postgres episode. And the post they discussed is generating lots of test data with postgres fast and faster. This is something we covered on last week's Scaling postgres, but you can definitely watch this blog post to get his insight into this blog post as well. He covers this post here, waiting for PostgreSQL 16 invent Random Normal to provide normally distributed random numbers. And this is from Depeche.com and you can see when you're using this function you can actually create a normal distribution of data. And this could be advantageous if you're wanting to fabricate data using Generate series to do testing. So you can check this out if you want to learn more about it. Next piece of content rethinking buffer mapping for modern hardware architectures. This is from Oreolidata.com and Orioli DB is actually a new storage engine for PostgreSQL, so it tries to take a modern approach to handling data that results in a number of advantages, a lot of which are performance related. So the first part is the buffer work and that is what this blog post covers rethinking Buffer mapping for modern modern hardware architectures. And it goes through the process of how they've redesigned it to achieve, in a lot of cases, four times greater throughput for read only scalability test. But they also have reworked MVCC to use an undo log concept. So as opposed to marking a row for deletion, that happens now in a table in postgres. This actually maintains an undo log. And then the third is supporting copy on write checkpoints. So it's all about giving postgres better performance. So this is the first blog post that I've seen them do on their website. And you should know that this is Alpha software. At this point, it's not even in beta yet, but if you want to learn more about it, definitely check out this blog post. Next piece of Content are two blog posts about the advent of code. How to solve advent of Code 2022 using postgres. There's day eleven and day twelve. And this is from Crunchydata.com. You can check these blog posts out to see their solution. Next piece of content time zone transformation using location, data and PostGIS. This is from Crunchydata.com. They're talking about a scenario where they were receiving some data and the timestamp did not have the time zone, but they had geolocation data. So they used the geolocation data to look up what zone it should be in and then modified the timestamp from a UTC timestamp to one in the proper time zone. So if you want to learn how to do that, you can check out this blog post. Next piece of Content avoiding Constraint Violations while Migrating Oracle to PostgreSQL Date Data Type this is from Migops.com, so definitely something valuable to know if you're looking to migrate from oracle to postgres. The next two posts are also related to Oracle to Postgres migrations. This one is substring or substrat Functionality Differences between Oracle and PostgreSQL what you need to know and this is from database Rookies WordPress.com. And the next post, also from them is Unleashing Boolean data typecasting in PostgreSQL. So definitely three pieces of content to check out if you're thinking about an oracle to postgres migration. Next piece of Content there was another episode of Postgres FM this week. This one was on copying a database. So if you're looking for all different ways to move your database from one cluster to another, definitely check out this piece of content. You can listen to the episode or watch it on YouTube. And the next piece of content the PostgreSQL person of the week is Francesco Tizio. If you're interested in learning more about Francesco and his contribution to Postgres, definitely check out this blog post and the Last piece of Content we did have another episode of the Reproductive Dev show this past Thursday afternoon. We've actually changed the time on that, but this was on Adventures and Notetaking. So this wasn't developer heavy content, but it was definitely all about learning and how to retain what you learn. And we talked all about different tools that you can use to help learn and retain the information that you need to. So if you're interested in that type of content, we definitely welcome you to check out our show.
[00:16:56] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show. Notes be sure to head over to Scalingposgres, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks.