Slow Updates, Lateral Joins, Serial To Identity, PG 15 Changes | Scaling Postgres 238

Episode 238 October 24, 2022 00:13:05
Slow Updates, Lateral Joins, Serial To Identity, PG 15 Changes | Scaling Postgres 238
Scaling Postgres
Slow Updates, Lateral Joins, Serial To Identity, PG 15 Changes | Scaling Postgres 238

Oct 24 2022 | 00:13:05

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss why and what to do when updates are slow, use cases for lateral joins, moving from serial to identity for auto-incrementing ids and changes to Postgres 15.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/238-slow-updates-lateral-joins-serial-to-identity-pg15-changes/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about slow updates, lateral joins, serial to identity and postgres 15 changes. I'm Kristen Jameson and this is Scaling Postgres episode 238. [00:00:22] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is, why are my PostgreSQL updates getting slower? This is from Cyber. Python postgresql.com. He's talking about a situation where a job was running that was constantly doing updates, but they kept getting slower each update such that the final one was, I think, close to 100 times slower. So here's an example of what caused it. There was an item table that records items of some value. And then there is a summary table that only has one row in it. It's basically the sum total of the values that are in the item table. Now, there is a trigger that's been created. Whenever an item gets inserted for each row, it fires off this function that updates the value. So if you insert a five, it's going to add five to the sum of this one row table here. Or if you add a ten, it'll add a ten to it. So basically, as items are inserted into the item table, the sum total in the sum table is added up. Now, they are running this deferrable, initially deferred, to try and make it as performant as possible. And they inserted the values using the copy command. So it's basically they're inserting 100,000 random numbers using this technique here. So what is causing this? So, to help diagnose it, they went ahead and set up Auto Explain, which is an extension that gives you an explain plan for queries that you specify. And with this configuration, they're basically logging all queries and getting information like buffers, et cetera. And what they found is the first update looks like this, and the 100,000th update looks like this. Now, they're pretty similar, but you can tell that the actual time for doing the sequential scan is super long for the last update as opposed to the first. And the other thing you can see is that the shared buffers, the number that are being hit by the first update is one, whereas it's 443 for the last update. So what's causing this to happen? And this is basically postgres's MultiVersion concurrency control. Basically, whenever you update a row, it doesn't update that row in place, but it actually creates a new row. And then later those rows have to be vacuumed up. So basically, you keep adding more rows the more updates that you do to the table. And then he says, okay, well, vacuum is supposed to clean this up. Why is it not cleaning it up? Well, the problem is this is happening all in one transaction. So it's essentially one long running transaction, inserting 100,000 rows, so there's no time for it to go ahead and vacuum it up. The other thing is that they didn't create an index on that sun table. So what they're updating, there's no index on it. So that should enable heap only Tuple updates. So you shouldn't have to be updating indexes or anything like that. Should be making it faster. And you should also get Hot Pruning, which is kind of like a mini vacuum when it's doing the update. But not even this can work. It's the same issue as vacuum is. It can't happen because the transaction is still open. It's one long running transaction and he said even if you try to add indexes to it, it'll actually make the problem worse because you have more information that you need to keep up to date in terms of managing the indexes too. Now, to help see what was going on, he also installed the Page Inspect extension. And here you can see that hot updates are being done, but these Tuples are just dead. So every update that happens is creating a new row. Now, the solution he took to address this is actually don't have the trigger run after every insert, but do it after every statement. So basically it runs this sum function after each statement. So when all the data has been inserted for a given statement, then it runs the sum function and the function has been modified to handle that type of an update. So I think this went from over a minute to 90 milliseconds using this technique. So basically you want to be cautious whenever you're doing updates in postgres because it creates these extra rows. So if you're trying to update a single row multiple times, you're going to be creating it multiple times. So you need to be very cautious how many times you're doing that. Now they slowed this down essentially by doing that update every statement as opposed to for each row in a statement 100,000 times. But there's different ways you could address it. You could focus on using append only tables to try and avoid this situation. Or if you're doing a bulk load of something, then incorporate a pause into the loading of data such that Auto vacuum has a time to catch up, or even run a manual vacuum between stages of a data load. That could help as well. Now, related to this, the next piece of content is five minutes of postgres. Episode 40 how to Debug Slow Postgres Updates in Bloated Tables with Auto explain and page Inspect this is from Pganalyze.com and the previous post is exactly what Lucas covers in this episode. So if you want to learn more about that, definitely check out his piece of content as well. [00:05:22] Next piece of content for each loops with lateral joins. This is from Sqlfordevs.com and he's talking about using lateral joins to efficiently pull data that is otherwise hard to get. So his example here is you have customers and you want to find the last three sales from those customers. Now you can use a sub query to do this but basically you have to pull and group by the customer ID and only pull the last three and then creating that type of join to the customers table that can end up leading to an inefficient join query, whereas using a lateral join is frequently much more efficient and gives you better performance. So anytime you kind of want to do a for each loop within a particular query for each row, definitely look to use a lateral join to do that. And this post explains how to do that, how it works in postgres and even mentions MySQL as well too. So definitely check this out if you want to learn more. Next piece of Content migrate PostgreSQL IDs from Serial to Identity after upgrading to Django four one, this is from adamj EU and since the introduction of identity in PostgreSQL Ten, they give you a different way to create an auto incrementing integer. Basically, historically you could use Serial but now you can use the SQL standard generated by default as Identity or generated always as Identity and it gives you a number of conveniences for working with these types of auto incrementing integers. One, it's easier to grant permissions to end users, it is a part of the SQL standard so once you learn it you could use the same syntax in other databases. It's actually easier to manage the next value compared with having to do the next value a sequence. If you're going to copy a table, those sequences are actually copied. And the other feature benefit of it is that you can strictly enforce using automate incrementing integers, meaning you can't manually insert one if you do generate it always. So in my opinion this is really great that a framework, an application framework has started using these. I use Rails and to my knowledge they aren't supporting this yet. They still use Serial but Django has moved forward in version 4.1 and is supporting this for new tables. It still works for the serial for old tables but he talks about should you potentially migrate for your older columns and he actually developed some code to do this type of conversion for you. So if you use Django and are interested in making this transition, definitely check out this blog post next piece of Content postgres 15 configuration Changes this is from Rustprooflabs.com and what they do is they have a tool that they created called a PG config site and they compare the configuration changes of postgresql.com between versions. So from version 14 to 15 there are six new parameters you can change. Three have been updated potentially with new defaults and then one was actually removed. So some of the changes are they now have a log startup progress interval to basically minimize logging up on startup. Log Checkpoints is now on by default and Log auto vacuum min duration is now set at ten minutes. And they talk about a few others here and basically the link to this PG config is here and here you can see all the different changes what it was in 14, what it is now in 15, and you can make changes for looking at other versions as well. If you're looking to migrate from something earlier than 14 to 15 as well, you could use it using this tool. So definitely check this out if you're interested in that. Next piece of content. Exclusive backup mode finally removed in postgres 15 this is from Enterprisedb.com. An exclusive backup mode was something that was deprecated in 9.6, but now it's fully removed. And to make sure that no one is really using it, they actually changed some of the function names from PG Start Backup, PG Stop Backup to PG Backup Start and PG Backup Stop. So if you actually use some of these lower level functions to do your backup, like for example, I have a client that's using US for doing file system based snapshots for taking their backup. They didn't want to use PG based backup, which is what I typically use. We had to use these functions in order to do a snapshot based backup. So if you want to learn more about these changes, you can check out this blog post. Next Post PostgreSQL 15 Logging in JSON this is from F Luco 1978 GitHub IO and he's talking about how you can enable the new JSON logging and also log to Standard Error as well. And he shows you how you can output this information. Then of course now you can consume it with something that consumes JSON. So check out this blog post if you're interested in that next piece of content. Postgres 15 adds copy header and matching this is from Enterprisedb.com. They're talking about how now for not only CSV files but also just standard text you can output with header. So they show an example here where you say with header and be aware that once you try to actually load that file back in, you're going to need to specify with header before or you'll get an error. And you can also specify header match and that will make sure that the columns that have been output as part of the output file match the columns that are being inserted. So in this example, they renamed one column and then they tried to import it with a match and it gave an error message. So definitely a great addition. Check this blog post out if you want to learn more. Next piece of content upgrading. PostgreSQL 14 to 15 on Fedora Redhead Enterprise Linux syntos. Rocky Alma Linux with PGDG Rpm packages. If you run one of those and want to upgrade a postgres 15, you can definitely check out this blog post for guidance on how to do that using PG upgrade. And I should say this is from Endpointdev.com. Next piece of content related to that is using PG Upgrade to upgrade PostgreSQL 9.6. PostGIS 2.4 to PostgreSQL 15 and PostGIS 3.3 on Yum. So this not only does the postgres upgrade, but also describes how to upgrade PostGIS as well. So definitely check out this blog post if you're having to do that. And this is from Boston. Gis.com next piece of content, PostgreSQL at the Pass DATACommunity Summit 2022. This is from Software and booze.com and he's mentioning there will be a full day pre conference training about PostgreSQL. It looks primarily for people who are coming from Microsoft SQL Server to postgres, but you can definitely check out the course content here if you're interested in attending. Next piece of Content There was another episode of Postgres FM this week. This one was on stored procedures and whether you should again implement your business logic in the database versus the application side. [00:12:09] Next piece of content. The Postgres Girl Person of Week is Cedric Dupree. If you want to learn more about Cedric and his contributions to Postgres, definitely check out this blog post. [00:12:20] And the last piece of content, we did have another episode of the Rubber Duck Dev show this past Wednesday evening. This one was on choosing to go off the rails with your application framework and choosing to stay off them. And this discussion was with Joel Draper. So if you're interested in that kind of long form developer discussion, we welcome you to check out our show. [00:12:40] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 25

August 13, 2018 00:13:14
Episode Cover

generate_series, PG11 Partitioning, GPU Queries | Scaling Postgres 25

In this episode of Scaling Postgres, we review articles covering graphing with generate_series, partitioning in Postgres 11 and GPUs for queries. To get the...

Listen

Episode 109

April 13, 2020 00:12:35
Episode Cover

Show Plans, WAL Monitoring, Using UUIDs, Default Logging | Scaling Postgres 109

In this episode of Scaling Postgres, we discuss how to show live plans, how Postgres 13 allows WAL monitoring, how to use UUIDs in...

Listen

Episode 182

September 12, 2021 00:13:50
Episode Cover

Boundless Text, Revoked Permissions, Index Bloat, Hardware Performance | Scaling Postgres 182

In this episode of Scaling Postgres, we discuss having boundless text fields, revoking public schema permissions, less index bloat in PG14 and comparing hardware...

Listen