Postgres In 2024 | Scaling Postgres 297

[00:00:00] Hi everyone, welcome to 2024. I hope everybody had a great holiday season and you're back and ready to go for 2024. But before we get started, I hope you, your friends, family and coworkers continue to do well. And to kick off 2024, I thought I would start with this post from Jonathan Katz on thoughts on PostgreSQL in 2024. This is from JCatz five and at the bottom he said he thought this was going to be a quick post, but he ended up spending multiple days on it and I think it shows it is a really good post and I would encourage you to read this because I can't cover everything and he has links to other supplemental material you should definitely check out. But to get started, he was talking about having a feature development meeting at PGCoN in 2023, and the topic was what are the biggest challenges for PostgreSQL users? And he basically put those into three different buckets. One was availability, another was performance, and another was developer features. So a lot of the blog post covers that. [00:01:06] Now, in terms of availability, what he's talking about is having your database available, meaning that there's no downtime, potentially due to misconfigurations, or that if you're doing upgrades there's minimal downtime, or even avoiding excessively locking something when you want to make some change, basically making availability as high as it can possibly be. Now, the first error he talks about is with regard to logical replication and how, with the new features that have come out this year, being able to support active active. Basically this is the new name for Multimaster, basically having two active or two primary servers at the same time. So if one server goes down, you still always have access to an instance to write to. The other one is talking about doing blue green deployments, so you always have an instance to go to, and logical replication can provide that capability and zero downtime upgrading. Now, I have not used a lot of logical replication in the past, but I've recently started using logical replication for postgres upgrades. So the last few upgrades I've done, I haven't been using PG upgrade, but I've been using a logical replication solution to do it, for a couple of reasons. One, it can help you minimize downtime even more than PG upgrade can. I mean, with the link capability of PG upgrade, you can really cut down your downtime, but the logical replication switchover is even faster in some of my experience. And two, when you're dealing with collation changes that can occur across operating systems, you're going to be stuck with that on PG upgrade and having to re index all your indexes with a logical replication solution, you don't have to worry about that. So that's super great, and I've liked using it so much, I think this could be a path forward for doing any sort of upgrade. So not just major versions, but even if you wanted to do a minor version upgrade, I could see using this technique as well to kind of minimize some of that downtime. And could this be the future other than streaming from the wall? In other words, do they just become logical replicas you're using as opposed to wall streaming replicas? I don't know, but I see that as an interesting path forward. Now, they don't discuss that, particularly here, but they talk all about the enhancements that happened this year in terms of the active active feature set that was added, but also the things that are waiting in the wings in 2024, which I'm really excited about. One is being able to support sequence transitions, so for example, sequences don't get passed over to the replica. That would be awesome to have two is the fact that DDL statements aren't being transferred from the primary to the replica. Having that capability would be a great boon. And then the other thing that he didn't mention here that would be awesome to incorporate in postgres is being able to transfer the state of a replication slot from a primary to a replica. So if it gets promoted, that replication slot is already present and at the correct location. So if these particular three areas are addressed, I could see a lot more use cases for using logical replication, and they would be great additions. The other thing he talks about is unblocking different locks. So this is where there are certain activities where you alter a table that it requires an access exclusive lock. So being able to minimize the number of times that happens. So is there an option, for example, like you can create indexes concurrently? Could you do some of these table changes concurrently as well? That would be awesome. The next bucket that he talked about is performance related changes. So the first one he talks about is being able to support really large instance sizes of computers and can postgres scale to use a 448 cpu box with 24 terabytes of ram, for example. The other area they're talking about is I O. So being able to support direct I o, not relying necessarily on the operating system, but also asynchronous I o. And apparently a lot of that work is being worked on, and we'll see what comes to fruition in 2024. [00:05:11] Another huge area that he mentions, I don't know how much work is being done on it is parallel recovery. So right now, if your primary system goes down and it needs to restart and it starts reading back on the wall, it does that in a single process. Or if you're having a replica that's trying to follow a primary and trying to get caught up, it uses a single process to do that. But the problem is, I've had some customers that it's a struggle to get that replica caught up because it's a single process. And if you could paralyze it, paralyze that recovery, it would allow the replicas to catch up faster. It would allow a primary to reboot faster as it replays the different wall files. So that's not only a performance reason, but also a availability reason. So this kind of fits into both buckets, and then the next bucket is developer features. So he's praising things like the multirange data type that was recently added, as well as the work that's been done on JSOn. But he said they've fallen a little bit behind in the feature set of postgres relative to the current SQL JSON specification, so they should probably apply more work to that. And also applying different features that exist in some other databases, like global temporary tables, global partitioned indexes and things of that nature, should help people migrate to postgres if we had those particular features. And of course he follows up talking about AI and machine learning data and how the PG vector extension is great, but should some of those features go into core postgres? He does talk briefly about security in particularly about encryption and transparent data encryption, and how working towards that is important. He talks a little bit about extensions and how it's great that we have extensions, but he says he feels the effect of an extension sprawl, like there's so many extensions out there. I'm not necessarily sure how you address that, but apparently there are multiple extension repositories available. If those could be coalesced somewhat that may make the issue easier. And lastly, he talks about community building and basically making it as accessible as possible for everyone to join the community and make a contribution. So that's a great point of view as well. So definitely highly encourage you to check out this blog post. I think it's the best one for the new year. [00:07:40] Next piece of content do you really need foreign keys? This is from Shayon Dev, and I'll go out on a limb and say you don't need them until you need them. And what I mean by that starting out, maybe they're not as important if you know your code and all the data changes that are happening to the database and you follow particular guidelines. But once you start introducing more people, more programmers to the Mex, that's when constraints like foreign keys become incredibly important, because maybe you're using an OrM, but that ORM can be bypassed by most programmers. That could defeat the constraint capabilities of some application frameworks. And there the only thing protecting you is a foreign key constraint for some issues you may encounter. So that's why I basically say you probably don't need them until you need them. But his reason not to use them religiously is performance impact during writes and deletes. So this is definitely a case whenever you place some sort of constraint on the system that must be checked before it does whatever action it's going to do, whether you're inserting, updating or deleting data. Now, he doesn't have an insert example here because I wonder what the performance difference of that would be. But he shows a delete example for a particular schema he set up, and it took half a second to delete a set of records with foreign key constraints. When he removed those foreign key constraints, did the same delete, it took one 10th of a second, so it was five times faster without the foreign key constraints doing delete operations. So there definitely is a performance cost. The next area, he says to possibly not use them is unnecessary lock contention. So yes, to check those constraints, it could lead to some lock contention. So that's definitely something to be aware of in your system. Next is complexity in database migrations and scaling. I don't know so much about scaling, but definitely doing database migrations you have to do them more carefully when they're foreign key constraints there because you have to do things in a certain order to make sure that everything's accurate. Next is unwarranted indexes. So generally when you put on a foreign key constraint, you're going to want an index added as well. So those are additional indexes that maybe you wouldn't ordinarily use, but you need them there in the off chance you're going to have to validate that constraint when you're doing some sort of operation. But maybe you wouldn't ordinarily need that for doing the queries against that particular table. So it's just an additional tax on modifications to that particular table. [00:10:21] And basically he says, practice reflection. So he says, quote, as you start on designing a new project, I think it's perfectly fine to initially incorporate foreign keys, but then revisit this periodically to make sure that it makes the most sense. I think I'd probably recommend something similar at this time. Go ahead and implement it while the project's small, but know that they could be a burden for your performance. And as you start scaling then make the determination, should we drop the constraints for these particular tables in these cases to get the most performance? Or is the data integrity more important to you so you can make those decisions at that time? But I thought this was a thought provoking blog post. Definitely encourage you to check it out. [00:11:05] Next piece of content PostgresQl 17 support for incremental backups and this is from dbiervices.com and this is awesome, the fact that we may get incremental backups in Postgres 17. So right now when you're doing a backup, you would use PG based backup that copies all the data files and some of the wall files, and then you would need to retain all of the wall files for a particular point in time. And when you need to do a restore, you restore that backup that you took and all the wall files to bring the database back up to that point in time. So depending on how often you take that full backup, it could be one day's worth of wall files, it could be three days, it could be seven days. Depends on how often you do that full backup. But if your system goes down and you need to do a restore, you're going to be applying all of those wall files that have been generated. But with an incremental backup it's a different way to do it. So you can take the full backup and then you only backup what has changed since the last full backup. So maybe you take a full backup once a week, incrementals every day, and you still have the wall file, so you can restore to a particular point in time in presumably a much faster way. Or maybe you still do a backup every day, but every couple of hours or every 4 hours you do an incremental backup and then the wall files from there. That should allow you to restore even faster. So he shows an example of that here he creates a database, inserts some data into it, and then he does a PG based backup, just backs up the database to the particular directory, added some additional data, and then he did an incremental backup using PG based backup, using the incremental flag to define the backup manifest that was created for the initial backup, and he defines the new directory where this incremental backup should go. Basically, he did it. The backups underscore incr one directory. There is something that needs to be enabled in order to make this happen, and that is wall summarization. So that helps the system determine what blocks have changed since the last full or last backup. So you do need to enable that first before you do the first full backup. But he went ahead and did that, did the backups again. Then he did a second incremental backup. And then to do the restore you actually used a new tool called PG combined backup. So basically you use this command, PG combined backup. You give it the main backup directory and all the incrementals you want and you output it to a particular destination directory. So in this case he's using VAR temp restore. And once this has been restored, you can simply navigate to this directory and start postgresql. And it works. So like I said, this should be much faster for restore times using this incremental backup solution. So this is great. Now, related to this, this is also the piece of content that was covered in five minutes of postgres by pganalyze.com and it's waiting for postgres 17 incremental based backups. So Lucas doesn't talk about this post, but he talks about the patch that was done and he shows examples of running it in his video. So if you want to learn more about this, definitely encourage you to check out his piece of content as well. [00:14:23] Next piece of content five ways to implement not exists in PostgresQl this is from eversql.com and he's proposing an example where you have a set of users and they have sessions to something on the database or to the application, and you want to know what users have not had any sessions. So a set of users quote not having any activity. So there's five different ways to do this, he proposed. One is using not exists, the other is using not in so not in the set of ids. Another is using not all. Another is doing a join between them, a left outer join from users to sessions, and then checking for a null value and say the primary key of the session to determine what users haven't had sessions, or using the accept keyword. So he goes through and shows the different syntax of the commands to use here that I won't go into detail on this, you can definitely check out the blog post for that. And he suggests cases where one technique is slow, one technique is fast, but I like the performance results at the bottom here. So each of these cases in the columns, the first one is 10,000 users 10,000 sessions the next one is a million users 10,000 sessions the next one is 10,000 users a million sessions. And the final column is a million users. A million sessions. Now, looking across this clearly don't use not all, probably don't want to use not in although with small data sets it's not terrible. But the ones that seem to have the most consistent performance is not exists and the left join method and I have tended to use the left join method I haven't used too much of not exist. But what's interesting is that not exists actually gives the lowest performance for the largest sizes, although it's close. So I don't know what the statistical significance of the performance difference is here, but definitely these are the two I would probably go to if I needed to do these types of queries. But check out this blog post if you want to learn more. Next piece of content praise, criticism and dialogue this is from Arhas blogspot.com and this was an interesting post. It's talking about the postgres community and talking about how he feels. A lot of the mailing list culture he says quote feels bruising, meaning there's a lot of criticism but not as much praise, and he feels we should try to focus more on the praise. You still need criticism to build an excellent product, but it should be balanced with sufficient praise as well. And I can definitely agree. I find myself being too critical on myself and others or other situations, so having that balance is definitely important. And what I found interesting is all of the comments talking about the importance of this message, essentially. So definitely a good post to talk about and think about. [00:17:28] Next piece of content there was another episode of Postgres FN for the past two weeks. The first one was on partitioning by Ulid, which I think is a precursor to a time based UuId. But a lot of what they were talking about is using UUID version seven ids, which are time based in the beginning and random at the end, and how those will be really valuable in the future. And people may tend to use those as opposed to big serial columns or big ends with a sequence attached to them, because they don't have the problem of fully random uuids that get written all over the index and cause performance problems for postgres, but they give you relatively incremental ids that you can use for primary keys. So this is actually something I hope will come out in 2024 as well as the version seven uuids for postgres. They were talking and they said it looks like it's been delayed until the standards groups actually finalize UuID version seven, but we'll just have to see what happens. But check out their piece of content if you want to learn more. The next episode from them was on Workmen, and this is a Nikolai solo episode where he talks about using workmen and how he likes to configure it, and I really like his opinion here. It's kind of what I do as well, whereas you basically set a particular value to start, make sure it's not too high that you're going to run into an out of memory situation, and then you basically monitor what's going on. So things like checking the logs to see if you're writing temporary files. So that maybe means your work mem needs to be a little bit higher for particular queries, and maybe you increase work mem for the whole system. But another path you can take is maybe you only want to do it for particular users. Like a reporting user really needs to run some complex queries, so this user needs a higher work mem. Or maybe, you know, this particular query is going to use a lot of work mem. So for that particular session or for that particular transaction where this report needs to be run, you give it a high work mem. But it's definitely important to log and monitor what's going on to make sure that you minimize issues of your workmen being too low and causing queries to write to disk, because that just diminishes your performance. But definitely another great episode. I encourage you to check out and the last piece of content is demand the impossible rigorous database benchmarking. This is from Earthelion info and this is more of a thought piece on the challenges of benchmarking and all the variables that you need to take into account when doing it and how hard it really is, along with some examples that he's encountered. So if you want to learn about the challenges of benchmarking, definitely encourage you to check out this blog post. [00:20:18] I hope you enjoyed this episode. Be sure to check out scalingposgres.com where you can find links for all the content, this rest, the podcast audio, as well as the full transcript. Thanks and I will see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 150

Scaling Advice, Vicious Cycle, Improve Observability, Addressing Corruption | Scaling Postgres 150

Episode 370

New Scale Out Options | Scaling Postgres 370

Episode 62

Index Include Clause, Exporting CSV, JSON, Zedstore | Scaling Postgres 62